Re: Index size vs. number of documents

2008-08-14 Thread Chris Hostetter
: > I'm surprised, as you are, by the non-linearity. Out of curiosity, what is Unless the data in "stored" fields is significantly greater then "indexed" fields the Index size almost never grows linearly with the number of documents -- it's the number of unique terms that tends to primarily in

Re: IndexOutOfBoundsException

2008-08-14 Thread Yonik Seeley
Since this looks like more of a lucene issue, I've replied in [EMAIL PROTECTED] -Yonik On Thu, Aug 14, 2008 at 10:18 PM, Ian Connor <[EMAIL PROTECTED]> wrote: > I seem to be able to reproduce this very easily and the data is > medline (so I am sure I can share it if needed with a quick email to >

Re: IndexOutOfBoundsException

2008-08-14 Thread Ian Connor
I seem to be able to reproduce this very easily and the data is medline (so I am sure I can share it if needed with a quick email to check). - I am using fedora: %uname -a Linux ghetto5.projectlounge.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30 13:18:33 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux %java -vers

Re: Simple Searching Question

2008-08-14 Thread Jake Conk
Rob, Actually I am copying *_facet to text. I have the following for copyField in my schema: This is my field configuration in my schema: Thanks, - Jake On Thu, Aug 14, 2008 at 5:49 PM, Rob Casson <[EMAI

Re: Simple Searching Question

2008-08-14 Thread Rob Casson
you're likely not copyField-ing *_facet to text, and we'd need to see what type of field it is to see how it will be analyzed at both search/index time. the default schema.xml file is pretty well documented, so you might want to spend some time looking thru it, and reading the commentslots of

Re: Best way to index without diacritics

2008-08-14 Thread Norberto Meijome
On Thu, 14 Aug 2008 11:34:47 -0400 "Steven A Rowe" <[EMAIL PROTECTED]> wrote: [...] > The kind of filter Walter is talking about - a generalized language-aware > character normalization Solr/Lucene filter - does not yet exist. My guess is > that if/when it does materialize, both the Solr and th

Re: More files in index directory than expected

2008-08-14 Thread Yonik Seeley
On Thu, Aug 14, 2008 at 6:31 PM, Chris Harris <[EMAIL PROTECTED]> wrote: > (The only time a > segment will be modified is if you delete files from it, and that will > only alter the segment's .del file, leaving .tis and friends alone.) Actually, these days .del files are even versioned. > I don't

Re: Simple Searching Question

2008-08-14 Thread Jake Conk
Hi Shalin, "foobar_facet" is a dynamic field. Its defined in my schema like this: I have the default search field set to text. Can I use more than one default search field? text Thanks, - Jake On Thu, Aug 14, 2008 at 2:48 PM, Shalin Shekhar Mangar <[EMAIL PROTECTED]> wrote: > Hi Jake, > > W

Re: More files in index directory than expected

2008-08-14 Thread Mark Miller
The main thing that bugs me about this index now is that the latest version of Luke (0.8.1) won't open it. ("Unknown format version: -6") The Solr Luke handler works fine with it, though. Luke comes with a released version of Lucene probably, while solr is using a later version. You have to

Re: More files in index directory than expected

2008-08-14 Thread Chris Harris
On Thu, Aug 14, 2008 at 2:01 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Chris Harris <[EMAIL PROTECTED]> wrote: >> It's my understanding that if my mergeFactor is 10, then there >> shouldn't be more than 11 segments in my index directory (10 segments, >> plus an additional segment if a mer

Re: Highlighting returns incorrect text on some results?

2008-08-14 Thread Mark Miller
A question mark huh? You sure there are no character encoding issues going on? Otis Gospodnetic wrote: Paul, we had many highlighter-related changes since 1.2, so I suggest you try the nightly. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message

Re: Simple Searching Question

2008-08-14 Thread Shalin Shekhar Mangar
Hi Jake, What is the type of the foobar_facet field in your schema.xml ? Did you add foobar_facet as the default search field? On Fri, Aug 15, 2008 at 3:13 AM, Jake Conk <[EMAIL PROTECTED]> wrote: > Hello, > > I inserted the following documents into Solr: > > > --

Simple Searching Question

2008-08-14 Thread Jake Conk
Hello, I inserted the following documents into Solr: --- 124 Jake Conk 125 Jake Conk --- id is the only requ

Re: QueryResultsCache and DocSet filter

2008-08-14 Thread Kevin Osborn
The DocSet isn't part of the cache key. The key is usually just a simple string (e.g. companyId). They just return a DocSet. I think the user caches are fine. This DocSet is then used as a filter for the actual query. I believe it is this step that is slow. However, I am guessing that the solut

Re: Highlighting returns incorrect text on some results?

2008-08-14 Thread Otis Gospodnetic
Paul, we had many highlighter-related changes since 1.2, so I suggest you try the nightly. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: pdovyda2 <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, August 14, 2008 2:56

Re: More files in index directory than expected

2008-08-14 Thread Michael McCandless
Chris Harris <[EMAIL PROTECTED]> wrote: > It's my understanding that if my mergeFactor is 10, then there > shouldn't be more than 11 segments in my index directory (10 segments, > plus an additional segment if a merge is in progress). Actually, mergeFactor 10 means each *level* will have <= 10 seg

Re: QueryResultsCache and DocSet filter

2008-08-14 Thread Yonik Seeley
On Thu, Aug 14, 2008 at 3:15 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote: > The problem here is that the calls in SolrIndexSearcher don't appear to use > the QueryResultsCache if the filer is a DocSet rather than a List. Right... using a DocSet as part of the cache key would be pretty slow (key co

Re: IndexOutOfBoundsException

2008-08-14 Thread Yonik Seeley
Yikes... not good. This shouldn't be due to anything you did wrong Ian... it looks like a lucene bug. Some questions: - what platform are you running on, and what JVM? - are you using multicore? (I fixed some index locking bugs recently) - are there any exceptions in the log before this? - how re

NOTICE - solrj MultiCore{Params/Request/Response} have been renamed CoreAdmin{Params/Request/Response}

2008-08-14 Thread Ryan McKinley
In the effort to clean up confusion around MultiCore usage, we have renamed the class that handle runtime core administration from "MultiCoreX" to CoreAdminX. Additionally, the path that the default MultiCoreRequest expects to hit is: /admin/cores rather then /admin/ multicore -- if you hav

QueryResultsCache and DocSet filter

2008-08-14 Thread Kevin Osborn
We have a bunch of user caches that return DocSet objects. So, we intersect them and send a DocSet filter and the actual query to getDocListAndSet or getDocList. The problem here is that the calls in SolrIndexSearcher don't appear to use the QueryResultsCache if the filer is a DocSet rather than

Re: term list

2008-08-14 Thread Grant Ingersoll
Assuming you mean significant in the traditional IR sense, I would start with the MoreLikeThis. See http://wiki.apache.org/solr/MoreLikeThisHandler In particular the mlt.interestingTerms option. As for phrases, that is a bit harder. You could try playing around with token-based n-grams (

Re: spellcheck collation

2008-08-14 Thread Doug Steigerwald
Right before I sent the message. Did a 'svn up src/;and clean;ant dist' and it failed. Seems to work fine now. On Aug 14, 2008, at 2:38 PM, Ryan McKinley wrote: have you updated recently? isEnabled() was removed last night... On Aug 14, 2008, at 2:30 PM, Doug Steigerwald wrote: I'd try

Highlighting returns incorrect text on some results?

2008-08-14 Thread pdovyda2
This is kind of a strange issue, but when I submit a query and ask for highlighting back, sometimes the highlighted text includes a question mark at the beginning, although a question mark character does not appear in the field that the highlighted text is taken from. I've put some sample XML out

IndexOutOfBoundsException

2008-08-14 Thread Ian Connor
Hi, I have rebuilt my index a few times (it should get up to about 4 Million but around 1 Million it starts to fall apart). Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33 at org.ap

Re: term list

2008-08-14 Thread Jack Tuhman
Humm, I am new to the world of search I am looking for something that will give me a list of significant words or phrases extracted from a document stored in solr. Jack On Fri, Aug 8, 2008 at 9:33 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > See https://issues.apache.org/jira/browse/SOLR-65

Re: spellcheck collation

2008-08-14 Thread Ryan McKinley
have you updated recently? isEnabled() was removed last night... On Aug 14, 2008, at 2:30 PM, Doug Steigerwald wrote: I'd try, but the build is failing from (guessing) Ryan's last commit: compile: [mkdir] Created dir: /Users/dsteiger/Desktop/java/solr/build/core [javac] Compiling 337 s

More files in index directory than expected

2008-08-14 Thread Chris Harris
It's my understanding that if my mergeFactor is 10, then there shouldn't be more than 11 segments in my index directory (10 segments, plus an additional segment if a merge is in progress). It would seem to follow that there shouldn't be more than 11 fdt files, 11 tis files, etc.. However, I'm looki

Re: spellcheck collation

2008-08-14 Thread Doug Steigerwald
I'd try, but the build is failing from (guessing) Ryan's last commit: compile: [mkdir] Created dir: /Users/dsteiger/Desktop/java/solr/build/core [javac] Compiling 337 source files to /Users/dsteiger/Desktop/ java/solr/build/core [javac] /Users/dsteiger/Desktop/java/solr/client/java/s

Re: spellcheck collation

2008-08-14 Thread Grant Ingersoll
I believe I just fixed this on SOLR-606 (thanks to Stefan's patch). Give it a try and let us know. -Grant On Aug 13, 2008, at 2:25 PM, Doug Steigerwald wrote: I've noticed a few things with the new spellcheck component that seem a little strange. Here's my document: 5 wii blackberry

Duplicate Data Across Fields

2008-08-14 Thread wojtekpia
I have 2 fields which will sometimes contain the same data. When they do contain the same data, am I paying the same performance cost as when they contain unique data? I think the real question here is: does Lucene index values per field, or per document? -- View this message in context: http://

Re: Synonyms help in 1.3-HEAD?

2008-08-14 Thread Matthew Runo
Thank you for your suggestion, I really don't see anything 'wrong' with the longer lists.. I entered https://issues.apache.org/jira/browse/SOLR-702 for this issue, and attached relevant files. If you need anything more, don't hesitate to contact me! Thanks for your time! Matthew Runo Softw

Re: Synonyms help in 1.3-HEAD?

2008-08-14 Thread Yonik Seeley
There should be no limit, so you may have uncovered a bug. Could you open a JIRA issue? If it's a real bug, it should get fixed before 1.3. -Yonik On Thu, Aug 14, 2008 at 12:35 PM, Matthew Runo <[EMAIL PROTECTED]> wrote: > Hello folks! > > Having a heck of a time trying to get a synonyms file t

Synonyms help in 1.3-HEAD?

2008-08-14 Thread Matthew Runo
Hello folks! Having a heck of a time trying to get a synonyms file to work properly. It seems that something's wrong with the way it's been set up, but, honestly, I can't see anything wrong with it. Some samples... This works... zutanoapparel => zutano But this does not... aadias, aadidas,

Re: Index size vs. number of documents

2008-08-14 Thread Phillip Farber
Erick Erickson wrote: I'm surprised, as you are, by the non-linearity. Out of curiosity, what is your MaxFieldLength? By default only the first 10,000 tokens are added to a field per document. If you haven't set this higher, that could account for it. We set it to a very large number so we in

RE: List of available facet fields returned with the query results

2008-08-14 Thread Barry Harding
Hi Shalin, As there is certainly the potential for several thousand different attribute types across all of our category's I guess I will have to manage them myself (was hoping for a short-cut or that I was missing a trick) but no problem. Solr still seems to outperform the commercial package we a

RE: Best way to index without diacritics

2008-08-14 Thread Steven A Rowe
Hi Norberto, On 08/14/2008 at 8:10 AM, Norberto Meijome wrote: > > On 8/13/08 9:16 AM, "Steven A Rowe" <[EMAIL PROTECTED]> wrote: > > > > > Hi Norberto, > > > > > > https://issues.apache.org/jira/browse/LUCENE-1343 > > hi Steve, > thanks for the pointer. this is a Lucene entry... I thought the

Re: List of available facet fields returned with the query results

2008-08-14 Thread Shalin Shekhar Mangar
Hi Barry, If each category has an exclusive set of fields on which you want to facet on, then you can simply facet on all facet-able fields (across all categories). The ones which are not present for the selected category will show up with zero facets which your front-end can suppress. However if

RE: Exception during Solr startup

2008-08-14 Thread Kashyap, Raghu
Hi Yonik & Erik, Thanks to both of you. It seems like our container had some issues and was causing this problem. Thanks, Raghu -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, August 13, 2008 10:57 AM To: solr-user@lucene

Re: Administrative questions

2008-08-14 Thread Jason Rennie
On Wed, Aug 13, 2008 at 1:52 PM, Jon Drukman <[EMAIL PROTECTED]> wrote: > Duh. I should have thought of that. I'm a big fan of djbdns so I'm quite > familiar with daemontools. > > Thanks! > :) My pleasure. Was nice to hear recently that DJB is moving toward more flexible licensing terms. For

List of available facet fields returned with the query results

2008-08-14 Thread Barry Harding
Hi, I have solr setup to index technical data for a number of different types of products, and this means that different product have different facet fields available. For example here would be a small example of the sort of data we are indexing, in reality there are between 10 and 20 facet fields

Re: Best way to index without diacritics

2008-08-14 Thread Norberto Meijome
( 2 in 1 reply) On Wed, 13 Aug 2008 09:59:21 -0700 Walter Underwood <[EMAIL PROTECTED]> wrote: > Stripping accents doesn't quite work. The correct translation > is language-dependent. In German, o-dieresis should turn into > "oe", but in English, it shoulde be "o" (as in "co__perate" or > "M__tle

Re: Spellcheker and Dismax both

2008-08-14 Thread Norberto Meijome
On Thu, 14 Aug 2008 12:21:13 +0530 "Shalin Shekhar Mangar" <[EMAIL PROTECTED]> wrote: > The SpellCheckerRequestHandler is now deprecated with Solr 1.3 and it has > been replaced by SpellCheckComponent. > > http://wiki.apache.org/solr/SpellCheckComponent which works quite well with dismax. B __