Re: Spellcheker and Dismax both
On Thu, 14 Aug 2008 12:21:13 +0530 "Shalin Shekhar Mangar" <[EMAIL PROTECTED]> wrote: > The SpellCheckerRequestHandler is now deprecated with Solr 1.3 and it has > been replaced by SpellCheckComponent. > > http://wiki.apache.org/solr/SpellCheckComponent which works quite well with dismax. B _ {Beto|Norberto|Numard} Meijome Never attribute to malice what can adequately be explained by incompetence. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Best way to index without diacritics
( 2 in 1 reply) On Wed, 13 Aug 2008 09:59:21 -0700 Walter Underwood <[EMAIL PROTECTED]> wrote: > Stripping accents doesn't quite work. The correct translation > is language-dependent. In German, o-dieresis should turn into > "oe", but in English, it shoulde be "o" (as in "co__perate" or > "M__tley Cr__e"). In Swedish, it should not be converted at all. Hi Walter, understood. This goes back to the question of language-specific field definitions / parsers... more on this below. > > There are other character-to-string conversions: ae-ligature > to "ae", "__" to "ss", and so on. Luckily, those are independent > of language. > > wunder > > On 8/13/08 9:16 AM, "Steven A Rowe" <[EMAIL PROTECTED]> wrote: > > > Hi Norberto, > > > > https://issues.apache.org/jira/browse/LUCENE-1343 hi Steve, thanks for the pointer. this is a Lucene entry... I thought the Latin-filter was a SOLR feature? I, for one, definitely meant a SOLR filter. Given what Walter rightly pointed out about differences in language, I suspect it would be a SOLR-level thing - fieldType name="textDE" language="DE" would apply the filter of unicode chars to {ascii?} with the appropriate mapping for German, etc. Or is this that Lucene would / should take care of ? B _ {Beto|Norberto|Numard} Meijome "I've dirtied my hands writing poetry, for the sake of seduction; that is, for the sake of a useful cause." Dostoevsky I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
List of available facet fields returned with the query results
Hi, I have solr setup to index technical data for a number of different types of products, and this means that different product have different facet fields available. For example here would be a small example of the sort of data we are indexing, in reality there are between 10 and 20 facet fields per product dependent upon its category, but a user could perform a search across more than one category. A typical search would be: Stage 1 1) User types a keyword 2) The matched category's from that search are displayed 3) User chooses a category 4) All results that match that category and keyword are displayed and it's at this point I would like to display the available facets and values. Name: Microsoft Optical Mouse for PC DPI: 1200 Type: Laser Category: PC Price: 45.00 Name: Eee PC Storage Type: Flash Screen Size: 17 Category: Netbook Price: 200.00 And so on. So if I search for "PC" and then category "Netbook" I would like solr to be able to tell me what facet fields are available without me resorting to a database to store which facets fields are available to search for which products, is there any way to get SOLR to return as part of the results a list of the facets available for the current search results or even better could I get it to automatically do a facet query for each of those fields to allow drill down querys. The current commercial tool we use that we are hoping solr can replace is called "FactFinder" and does exactly this, but I do have to have drilled down a number of times before this occurs, to stop the search attempting to return facets for every item in the index. I suspect am I missing a trick here or making this more complicated than needed, any help or ideas much appreciated. Thanks Barry H Misco is a division of Systemax Europe Ltd. Registered in Scotland Number 114143. Registered Office: Caledonian Exchange, 19a Canning Street, Edinburgh EH3 8EG. Telephone +44 (0)1933 686000.
Re: Administrative questions
On Wed, Aug 13, 2008 at 1:52 PM, Jon Drukman <[EMAIL PROTECTED]> wrote: > Duh. I should have thought of that. I'm a big fan of djbdns so I'm quite > familiar with daemontools. > > Thanks! > :) My pleasure. Was nice to hear recently that DJB is moving toward more flexible licensing terms. For anyone unfamiliar w/ daemontools, here's DJB's explanation of why they rock compared to inittab, ttys, init.d, and rc.local: http://cr.yp.to/daemontools/faq/create.html#why Jason
RE: Exception during Solr startup
Hi Yonik & Erik, Thanks to both of you. It seems like our container had some issues and was causing this problem. Thanks, Raghu -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Wednesday, August 13, 2008 10:57 AM To: solr-user@lucene.apache.org Subject: Re: Exception during Solr startup On Wed, Aug 13, 2008 at 10:55 AM, Kashyap, Raghu <[EMAIL PROTECTED]> wrote: > SEVERE: java.lang.UnsupportedClassVersionError: Bad version number in > .class file This is normally a mismatch between java compiler and runtime (like using Java6 to compile and Java5 to try and run). -Yonik
Re: List of available facet fields returned with the query results
Hi Barry, If each category has an exclusive set of fields on which you want to facet on, then you can simply facet on all facet-able fields (across all categories). The ones which are not present for the selected category will show up with zero facets which your front-end can suppress. However if the total number of such fields are very large, you may be better off managing the mappings yourself for performance reasons. But, as always, first measure then optimize :) On Thu, Aug 14, 2008 at 7:12 PM, Barry Harding <[EMAIL PROTECTED]>wrote: > Hi, > I have solr setup to index technical data for a number of different > types of products, and this means that different product have different > facet fields available. > > For example here would be a small example of the sort of data we are > indexing, in reality there are between 10 and 20 facet fields per > product dependent upon its category, but a user could perform a search > across more than one category. > > A typical search would be: > > Stage 1 > 1) User types a keyword > 2) The matched category's from that search are displayed > 3) User chooses a category > 4) All results that match that category and keyword are displayed and > it's at this point I would like to display the available facets and > values. > > > > > Name: Microsoft Optical Mouse for PC > DPI: 1200 > Type: Laser > Category: PC > Price: 45.00 > > Name: Eee PC > Storage Type: Flash > Screen Size: 17 > Category: Netbook > Price: 200.00 > > And so on. > > So if I search for "PC" and then category "Netbook" I would like solr to > be able to tell me what facet fields are available without me resorting > to a database to store which facets fields are available to search for > which products, is there any way to get SOLR to return as part of the > results a list of the facets available for the current search results or > even better could I get it to automatically do a facet query for each of > those fields to allow drill down querys. > > The current commercial tool we use that we are hoping solr can replace > is called "FactFinder" and does exactly this, but I do have to have > drilled down a number of times before this occurs, to stop the search > attempting to return facets for every item in the index. > > I suspect am I missing a trick here or making this more complicated than > needed, any help or ideas much appreciated. > > > Thanks > > Barry H > > > Misco is a division of Systemax Europe Ltd. Registered in Scotland Number > 114143. Registered Office: Caledonian Exchange, 19a Canning Street, > Edinburgh EH3 8EG. Telephone +44 (0)1933 686000. > -- Regards, Shalin Shekhar Mangar.
RE: Best way to index without diacritics
Hi Norberto, On 08/14/2008 at 8:10 AM, Norberto Meijome wrote: > > On 8/13/08 9:16 AM, "Steven A Rowe" <[EMAIL PROTECTED]> wrote: > > > > > Hi Norberto, > > > > > > https://issues.apache.org/jira/browse/LUCENE-1343 > > hi Steve, > thanks for the pointer. this is a Lucene entry... I thought the > Latin-filter was a SOLR feature? I, for one, definitely meant a SOLR filter. A fair portion of Solr is a set of wrappers over Lucene functionality. ISOLatin1FilterFactory, for example, wraps Lucene's ISOLatin1AccentFilter. Here is the entirety of the Solr code: public class ISOLatin1AccentFilterFactory extends BaseTokenFilterFactory { public ISOLatin1AccentFilter create(TokenStream input) { return new ISOLatin1AccentFilter(input); } } Of course, BaseTokenFilterFactory brings more to the party, but my point is that adding Lucene filters to Solr is generally a trivial exercise - a Solr ...FilterFactory around LUCENE-1343 would not be much longer than the four lines listed above, since the configuration aspects are already handled by BaseTokenFilterFactory. > Given what Walter rightly pointed out about differences in language, I suspect > it would be a SOLR-level thing - fieldType name="textDE" language="DE" would > apply the filter of unicode chars to {ascii?} with the appropriate mapping > for German, etc. > > Or is this that Lucene would / should take care of ? The kind of filter Walter is talking about - a generalized language-aware character normalization Solr/Lucene filter - does not yet exist. My guess is that if/when it does materialize, both the Solr and the Lucene projects will want to have it. Historically, most functionality shared by Solr and Lucene is eventually hosted by Lucene, since Solr has a Lucene dependency, but not vice-versa. So, yes, Solr would be responsible for hosting configuration for such a filter, but the responsibility for doing something with the configuration would be Lucene's responsibility, assuming that Lucene would (eventually) host the filter and Solr would host a factory over the filter. Steve
RE: List of available facet fields returned with the query results
Hi Shalin, As there is certainly the potential for several thousand different attribute types across all of our category's I guess I will have to manage them myself (was hoping for a short-cut or that I was missing a trick) but no problem. Solr still seems to outperform the commercial package we are using Thanks barry -Original Message- From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] Sent: 14 August 2008 16:06 To: solr-user@lucene.apache.org Subject: Re: List of available facet fields returned with the query results Hi Barry, If each category has an exclusive set of fields on which you want to facet on, then you can simply facet on all facet-able fields (across all categories). The ones which are not present for the selected category will show up with zero facets which your front-end can suppress. However if the total number of such fields are very large, you may be better off managing the mappings yourself for performance reasons. But, as always, first measure then optimize :) On Thu, Aug 14, 2008 at 7:12 PM, Barry Harding <[EMAIL PROTECTED]>wrote: > Hi, > I have solr setup to index technical data for a number of different > types of products, and this means that different product have different > facet fields available. > > For example here would be a small example of the sort of data we are > indexing, in reality there are between 10 and 20 facet fields per > product dependent upon its category, but a user could perform a search > across more than one category. > > A typical search would be: > > Stage 1 > 1) User types a keyword > 2) The matched category's from that search are displayed > 3) User chooses a category > 4) All results that match that category and keyword are displayed and > it's at this point I would like to display the available facets and > values. > > > > > Name: Microsoft Optical Mouse for PC > DPI: 1200 > Type: Laser > Category: PC > Price: 45.00 > > Name: Eee PC > Storage Type: Flash > Screen Size: 17 > Category: Netbook > Price: 200.00 > > And so on. > > So if I search for "PC" and then category "Netbook" I would like solr to > be able to tell me what facet fields are available without me resorting > to a database to store which facets fields are available to search for > which products, is there any way to get SOLR to return as part of the > results a list of the facets available for the current search results or > even better could I get it to automatically do a facet query for each of > those fields to allow drill down querys. > > The current commercial tool we use that we are hoping solr can replace > is called "FactFinder" and does exactly this, but I do have to have > drilled down a number of times before this occurs, to stop the search > attempting to return facets for every item in the index. > > I suspect am I missing a trick here or making this more complicated than > needed, any help or ideas much appreciated. > > > Thanks > > Barry H > > > Misco is a division of Systemax Europe Ltd. Registered in Scotland Number > 114143. Registered Office: Caledonian Exchange, 19a Canning Street, > Edinburgh EH3 8EG. Telephone +44 (0)1933 686000. > -- Regards, Shalin Shekhar Mangar. Misco is a division of Systemax Europe Ltd. Registered in Scotland Number 114143. Registered Office: Caledonian Exchange, 19a Canning Street, Edinburgh EH3 8EG. Telephone +44 (0)1933 686000.
Re: Index size vs. number of documents
Erick Erickson wrote: I'm surprised, as you are, by the non-linearity. Out of curiosity, what is your MaxFieldLength? By default only the first 10,000 tokens are added to a field per document. If you haven't set this higher, that could account for it. We set it to a very large number so we index the entire document. As far as I know, optimization shouldn't really affect the index size if you are not deleting documents, but I'm no expert in that area. I've indexed OCR data and it's no fun for the reasons you cite. We had better results searching if we cleaned the data at index time. By "cleaning" I mean we took out all of the characters that *couldn't* be indexed. What *can't* be indexed depends upon your requirements, but in our case we could just use the low ASCII characters by folding all the accented characters into their low-ascii counterparts because we had no need for native- language support. And we also replaced most non-printing characters with spaces. A legitimate question is whether indexing single characters makes sense (in our case, genealogy, it actually does. Sggghhh) Fortunatel non-printing characters are not a problem but we need native language query support so limiting to US-ASCII will not work for us. One possibility is to identify the dominant language in the document and use dictionaries to remove junk however proper names are a big problem with that approach. Another might be to use heuristics like removing "words" with numbers in the middle of them. Whatever we do wil lhave to be fast. In a mixed-language environment, this provided surprisingly good results given how crude the transformations were. Of course it's totally unacceptable to so mangle non-English text this crudely if you must support native-language searching. yes. I'd be interested in how this changes your index size if you do decide to try it. There's nothing like having somebody else do research for me . Best Erick On Wed, Aug 13, 2008 at 1:45 PM, Phillip Farber <[EMAIL PROTECTED]> wrote: We're indexing the ocr for a large number of books. Our experimental schema is simple and id field and an ocr text field (not stored). Currently we just have two data points: 3005 documents = 723 MB index 174237 documents = 51460 MB index These indexes are not optimized. If the index size were a linear function of number of documents, based on just these two data points, you'd expect the index for 174237 docs to be approximately 57.98 times larger that 723 MB or about 41921 MB. Actually it's 51460 or about 22% bigger. I suspect the non-linear increase is due to dirty ocr that continually increases the number of unique words that need to be indexed. Another possibility is that the larger index has a higher proportion of documents containing characters from non-Latin alphabets thereby increasing the number of unique words. I can't verify that at this point. Are these reasonable assumptions or am I missing other factors that could contribute to the non-linear growth in index size? Regards, Phil
Synonyms help in 1.3-HEAD?
Hello folks! Having a heck of a time trying to get a synonyms file to work properly. It seems that something's wrong with the way it's been set up, but, honestly, I can't see anything wrong with it. Some samples... This works... zutanoapparel => zutano But this does not... aadias, aadidas, aaidas, adadas, adaddas, adaddis, adadias, adadis, adaidas, adaies, addedas, addedis, addidaas, addidads, addidais, addidas, addidascom, addiddas, addides, addidis, adeadas, adedas, adeddas, adedias, adiada, adiadas, adiadis, adiads, adida, adidaas, adidas1, adidass, adidaz, adidda, adiddas, adiddias, adidias, adidis, adiidas, aditas, adudas, afidas, aididas, wwwadidascom => adidas This works... liumiani, loomiani, lumaini, lumanai, lumani, lumiami, lumian, lumiana, lumianai, lumiari, luminani, lumini, luminiani => lumiani But this does not... clegerie, cleregie, clergerie, clergie, robertclaregie, robert claregie, robertclargeries, robert clargeries, robertclegerie, robert clegerie, robertcleregie, robert cleregie, robertclergeic, robert clergeic, robertclergerie, robertclergi, robert clergi, robertclergie, robert clergie, robertclergoe, robert clergoe, robertclerige, robert clerige, robertclerterie, robert clerterie => Robert Clergerie This is how they're set up in my schema.. ignoreCase="true" expand="true"/> Is there a limit to the number of terms in the list of options? It seems that the ones that are shorter work, while the longer lists don't. I'm at a loss as to why though.. Thanks for your time! Matthew Runo Software Engineer, Zappos.com [EMAIL PROTECTED] - 702-943-7833
Re: Synonyms help in 1.3-HEAD?
There should be no limit, so you may have uncovered a bug. Could you open a JIRA issue? If it's a real bug, it should get fixed before 1.3. -Yonik On Thu, Aug 14, 2008 at 12:35 PM, Matthew Runo <[EMAIL PROTECTED]> wrote: > Hello folks! > > Having a heck of a time trying to get a synonyms file to work properly. It > seems that something's wrong with the way it's been set up, but, honestly, I > can't see anything wrong with it. Some samples... > > This works... > zutanoapparel => zutano > > But this does not... > aadias, aadidas, aaidas, adadas, adaddas, adaddis, adadias, adadis, adaidas, > adaies, addedas, addedis, addidaas, addidads, addidais, addidas, addidascom, > addiddas, addides, addidis, adeadas, adedas, adeddas, adedias, adiada, > adiadas, adiadis, adiads, adida, adidaas, adidas1, adidass, adidaz, adidda, > adiddas, adiddias, adidias, adidis, adiidas, aditas, adudas, afidas, > aididas, wwwadidascom => adidas > > This works... > liumiani, loomiani, lumaini, lumanai, lumani, lumiami, lumian, lumiana, > lumianai, lumiari, luminani, lumini, luminiani => lumiani > > But this does not... > clegerie, cleregie, clergerie, clergie, robertclaregie, robert claregie, > robertclargeries, robert clargeries, robertclegerie, robert clegerie, > robertcleregie, robert cleregie, robertclergeic, robert clergeic, > robertclergerie, robertclergi, robert clergi, robertclergie, robert clergie, > robertclergoe, robert clergoe, robertclerige, robert clerige, > robertclerterie, robert clerterie => Robert Clergerie > > This is how they're set up in my schema.. > ignoreCase="true" expand="true"/> > > Is there a limit to the number of terms in the list of options? It seems > that the ones that are shorter work, while the longer lists don't. I'm at a > loss as to why though.. > > Thanks for your time! > > Matthew Runo > Software Engineer, Zappos.com > [EMAIL PROTECTED] - 702-943-7833 > >
Re: Synonyms help in 1.3-HEAD?
Thank you for your suggestion, I really don't see anything 'wrong' with the longer lists.. I entered https://issues.apache.org/jira/browse/SOLR-702 for this issue, and attached relevant files. If you need anything more, don't hesitate to contact me! Thanks for your time! Matthew Runo Software Engineer, Zappos.com [EMAIL PROTECTED] - 702-943-7833 On Aug 14, 2008, at 10:16 AM, Yonik Seeley wrote: There should be no limit, so you may have uncovered a bug. Could you open a JIRA issue? If it's a real bug, it should get fixed before 1.3. -Yonik On Thu, Aug 14, 2008 at 12:35 PM, Matthew Runo <[EMAIL PROTECTED]> wrote: Hello folks! Having a heck of a time trying to get a synonyms file to work properly. It seems that something's wrong with the way it's been set up, but, honestly, I can't see anything wrong with it. Some samples... This works... zutanoapparel => zutano But this does not... aadias, aadidas, aaidas, adadas, adaddas, adaddis, adadias, adadis, adaidas, adaies, addedas, addedis, addidaas, addidads, addidais, addidas, addidascom, addiddas, addides, addidis, adeadas, adedas, adeddas, adedias, adiada, adiadas, adiadis, adiads, adida, adidaas, adidas1, adidass, adidaz, adidda, adiddas, adiddias, adidias, adidis, adiidas, aditas, adudas, afidas, aididas, wwwadidascom => adidas This works... liumiani, loomiani, lumaini, lumanai, lumani, lumiami, lumian, lumiana, lumianai, lumiari, luminani, lumini, luminiani => lumiani But this does not... clegerie, cleregie, clergerie, clergie, robertclaregie, robert claregie, robertclargeries, robert clargeries, robertclegerie, robert clegerie, robertcleregie, robert cleregie, robertclergeic, robert clergeic, robertclergerie, robertclergi, robert clergi, robertclergie, robert clergie, robertclergoe, robert clergoe, robertclerige, robert clerige, robertclerterie, robert clerterie => Robert Clergerie This is how they're set up in my schema.. Is there a limit to the number of terms in the list of options? It seems that the ones that are shorter work, while the longer lists don't. I'm at a loss as to why though.. Thanks for your time! Matthew Runo Software Engineer, Zappos.com [EMAIL PROTECTED] - 702-943-7833
Duplicate Data Across Fields
I have 2 fields which will sometimes contain the same data. When they do contain the same data, am I paying the same performance cost as when they contain unique data? I think the real question here is: does Lucene index values per field, or per document? -- View this message in context: http://www.nabble.com/Duplicate-Data-Across-Fields-tp18986515p18986515.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spellcheck collation
I believe I just fixed this on SOLR-606 (thanks to Stefan's patch). Give it a try and let us know. -Grant On Aug 13, 2008, at 2:25 PM, Doug Steigerwald wrote: I've noticed a few things with the new spellcheck component that seem a little strange. Here's my document: 5 wii blackberry blackjack creative labs zen ipod video nano Some sample queries: http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberri+wi&spellcheck=true&spellcheck.collate=true http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberr+wi&spellcheck=true&spellcheck.collate=true http://localhost:8983/solr/core1/spellCheckCompRH?q=blackber+wi&spellcheck=true&spellcheck.collate=true When spellchecking 'blackberri wi', the collation returned is 'blackberry wii'. When spellchecking 'blackberr wi', the collation returned is 'blackberrywii'. 'blackber wi' returns 'blackberrwiiwi'. Doug
Re: spellcheck collation
I'd try, but the build is failing from (guessing) Ryan's last commit: compile: [mkdir] Created dir: /Users/dsteiger/Desktop/java/solr/build/core [javac] Compiling 337 source files to /Users/dsteiger/Desktop/ java/solr/build/core [javac] /Users/dsteiger/Desktop/java/solr/client/java/solrj/src/ org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.java:129: cannot find symbol [javac] symbol : method isEnabled() [javac] location: class org.apache.solr.core.CoreContainer [javac] multicore.isEnabled() ) { Doug On Aug 14, 2008, at 2:24 PM, Grant Ingersoll wrote: I believe I just fixed this on SOLR-606 (thanks to Stefan's patch). Give it a try and let us know. -Grant On Aug 13, 2008, at 2:25 PM, Doug Steigerwald wrote: I've noticed a few things with the new spellcheck component that seem a little strange. Here's my document: 5 wii blackberry blackjack creative labs zen ipod video nano Some sample queries: http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberri+wi&spellcheck=true&spellcheck.collate=true http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberr+wi&spellcheck=true&spellcheck.collate=true http://localhost:8983/solr/core1/spellCheckCompRH?q=blackber+wi&spellcheck=true&spellcheck.collate=true When spellchecking 'blackberri wi', the collation returned is 'blackberry wii'. When spellchecking 'blackberr wi', the collation returned is 'blackberrywii'. 'blackber wi' returns 'blackberrwiiwi'. Doug
More files in index directory than expected
It's my understanding that if my mergeFactor is 10, then there shouldn't be more than 11 segments in my index directory (10 segments, plus an additional segment if a merge is in progress). It would seem to follow that there shouldn't be more than 11 fdt files, 11 tis files, etc.. However, I'm looking at one of my indexes now, and this doesn't seem to be the case. Here are the tis files for this index, for instance: 07/22/2008 07:49 PM77,925,180 _1je.tis 07/23/2008 02:57 AM65,988,651 _256.tis 07/23/2008 04:18 AM13,159,578 _29t.tis 07/23/2008 05:08 AM10,146,941 _2cw.tis 07/23/2008 05:39 AM 6,749,665 _2el.tis 07/23/2008 06:24 AM12,274,012 _2he.tis 07/23/2008 07:01 AM14,069,531 _2kh.tis 07/23/2008 07:53 AM13,795,213 _2nu.tis 07/23/2008 08:20 AM 6,284,902 _2p0.tis 07/23/2008 08:27 AM 1,980,945 _2p9.tis 07/23/2008 08:36 AM 1,674,640 _2pk.tis 07/23/2008 08:37 AM 311,483 _2pl.tis 07/23/2008 08:38 AM 285,881 _2pm.tis 07/23/2008 08:39 AM 245,138 _2pn.tis 07/23/2008 08:40 AM 116,881 _2po.tis 07/17/2008 11:22 PM69,635,905 _rp.tis 07/18/2008 12:59 AM15,883,866 _xu.tis There are 17 of these files. (File sizes are in bytes.) When I open up the index in Luke, it says all of them are "In Use" and it doesn't list any of them as "Deletable". This seems to rule out the possibility that Solr/Lucene somehow "forget" to clean up files that were no longer in use. I'm noticing that _2pk, _2pl, _2pm, _2pn, _2po are sequential file names, alphabetically speaking, and their last modified times are very close to one another. Does this mean they're actually part of the same segment, even though they are in separate files? If those files are indeed part of a single segment, then the number of segments represented by these files would really be 17-4=13. But that's still more than the expected 11 segments. I just discovered that one of my other indexes has over 11,000 tis files. That's disturbing. I'm not sure if it would have the same underlying cause. Any ideas?
Re: spellcheck collation
have you updated recently? isEnabled() was removed last night... On Aug 14, 2008, at 2:30 PM, Doug Steigerwald wrote: I'd try, but the build is failing from (guessing) Ryan's last commit: compile: [mkdir] Created dir: /Users/dsteiger/Desktop/java/solr/build/core [javac] Compiling 337 source files to /Users/dsteiger/Desktop/ java/solr/build/core [javac] /Users/dsteiger/Desktop/java/solr/client/java/solrj/src/ org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.java:129: cannot find symbol [javac] symbol : method isEnabled() [javac] location: class org.apache.solr.core.CoreContainer [javac] multicore.isEnabled() ) { Doug On Aug 14, 2008, at 2:24 PM, Grant Ingersoll wrote: I believe I just fixed this on SOLR-606 (thanks to Stefan's patch). Give it a try and let us know. -Grant On Aug 13, 2008, at 2:25 PM, Doug Steigerwald wrote: I've noticed a few things with the new spellcheck component that seem a little strange. Here's my document: 5 wii blackberry blackjack creative labs zen ipod video nano Some sample queries: http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberri+wi&spellcheck=true&spellcheck.collate=true http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberr+wi&spellcheck=true&spellcheck.collate=true http://localhost:8983/solr/core1/spellCheckCompRH?q=blackber+wi&spellcheck=true&spellcheck.collate=true When spellchecking 'blackberri wi', the collation returned is 'blackberry wii'. When spellchecking 'blackberr wi', the collation returned is 'blackberrywii'. 'blackber wi' returns 'blackberrwiiwi'. Doug
Re: term list
Humm, I am new to the world of search I am looking for something that will give me a list of significant words or phrases extracted from a document stored in solr. Jack On Fri, Aug 8, 2008 at 9:33 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > See https://issues.apache.org/jira/browse/SOLR-651. I've got some of this > coded up and hope to have a patch soon. > > Or, do you mean, is there a way to get the terms the MLT uses to generate > the new query? > > > On Aug 5, 2008, at 8:41 PM, Jack Tuhman wrote: > > Hi all, >> >> is there a way to get key terms from an item? If each item has an id, can >> I >> pass that ID to a search and get back the key terms like you can with the >> mlt filter. >> >> Does this make sense? >> >> Jack >> > > >
IndexOutOfBoundsException
Hi, I have rebuilt my index a few times (it should get up to about 4 Million but around 1 Million it starts to fall apart). Exception in thread "Lucene Merge Thread #0" org.apache.lucene.index.MergePolicy$MergeException: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33 at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300) Caused by: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33 at java.util.ArrayList.rangeCheck(ArrayList.java:572) at java.util.ArrayList.get(ArrayList.java:350) at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:188) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:670) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:349) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269) When this happens, the disk usage goes right up and the indexing really starts to slow down. I am using a Solr build from about a week ago - so my Lucene is at 2.4 according to the war files. Has anyone seen this error before? Is it possible to tell which Array is too large? Would it be an Array I am sending in or another internal one? Regards, Ian Connor
Highlighting returns incorrect text on some results?
This is kind of a strange issue, but when I submit a query and ask for highlighting back, sometimes the highlighted text includes a question mark at the beginning, although a question mark character does not appear in the field that the highlighted text is taken from. I've put some sample XML output on the web at http://ucair.cs.uiuc.edu/pdovyda2/problem.xml If you look at the first and third highlights, you'll see what I'm talking about. Besides looking a bit odd, it is causing my application to break because the highlighted field is multivalued, and I was doing text matching to determine which of the values was chosen for highlighting. Is this actually a bug, or have I just misconfigured something? By the way, I am using the 1.2 release, I have not yet tried out a nightly build to see if this is an old problem. Thanks, Paul -- View this message in context: http://www.nabble.com/Highlighting-returns-incorrect-text-on-some-results--tp18987598p18987598.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spellcheck collation
Right before I sent the message. Did a 'svn up src/;and clean;ant dist' and it failed. Seems to work fine now. On Aug 14, 2008, at 2:38 PM, Ryan McKinley wrote: have you updated recently? isEnabled() was removed last night... On Aug 14, 2008, at 2:30 PM, Doug Steigerwald wrote: I'd try, but the build is failing from (guessing) Ryan's last commit: compile: [mkdir] Created dir: /Users/dsteiger/Desktop/java/solr/build/core [javac] Compiling 337 source files to /Users/dsteiger/Desktop/ java/solr/build/core [javac] /Users/dsteiger/Desktop/java/solr/client/java/solrj/src/ org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.java:129: cannot find symbol [javac] symbol : method isEnabled() [javac] location: class org.apache.solr.core.CoreContainer [javac] multicore.isEnabled() ) { Doug On Aug 14, 2008, at 2:24 PM, Grant Ingersoll wrote: I believe I just fixed this on SOLR-606 (thanks to Stefan's patch). Give it a try and let us know. -Grant On Aug 13, 2008, at 2:25 PM, Doug Steigerwald wrote: I've noticed a few things with the new spellcheck component that seem a little strange. Here's my document: 5 wii blackberry blackjack creative labs zen ipod video nano Some sample queries: http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberri+wi&spellcheck=true&spellcheck.collate=true http://localhost:8983/solr/core1/spellCheckCompRH?q=blackberr+wi&spellcheck=true&spellcheck.collate=true http://localhost:8983/solr/core1/spellCheckCompRH?q=blackber+wi&spellcheck=true&spellcheck.collate=true When spellchecking 'blackberri wi', the collation returned is 'blackberry wii'. When spellchecking 'blackberr wi', the collation returned is 'blackberrywii'. 'blackber wi' returns 'blackberrwiiwi'. Doug
Re: term list
Assuming you mean significant in the traditional IR sense, I would start with the MoreLikeThis. See http://wiki.apache.org/solr/MoreLikeThisHandler In particular the mlt.interestingTerms option. As for phrases, that is a bit harder. You could try playing around with token-based n-grams (called Shingles) and MoreLikeThis together, for starters, I think. If you have some other notion of "significant" in relation to language in general, then you've got quite a bit more work to do, most of which is way beyond the scope of Solr (although it could plugin to Solr nicely). HTH, Grant On Aug 14, 2008, at 2:43 PM, Jack Tuhman wrote: Humm, I am new to the world of search I am looking for something that will give me a list of significant words or phrases extracted from a document stored in solr. Jack On Fri, Aug 8, 2008 at 9:33 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: See https://issues.apache.org/jira/browse/SOLR-651. I've got some of this coded up and hope to have a patch soon. Or, do you mean, is there a way to get the terms the MLT uses to generate the new query? On Aug 5, 2008, at 8:41 PM, Jack Tuhman wrote: Hi all, is there a way to get key terms from an item? If each item has an id, can I pass that ID to a search and get back the key terms like you can with the mlt filter. Does this make sense? Jack
QueryResultsCache and DocSet filter
We have a bunch of user caches that return DocSet objects. So, we intersect them and send a DocSet filter and the actual query to getDocListAndSet or getDocList. The problem here is that the calls in SolrIndexSearcher don't appear to use the QueryResultsCache if the filer is a DocSet rather than a List. So, the end result is that our new version of Solr is slower than our old version. Our old version cached Query objects instead of DocSets. However, it had quite a few other problems. I thought it would be an improvement to use DocSets. Is it recomended to re-work the caches to return Query objects and use that as my filterList? Our index size is about million documents. Our typical result set size is about 15, but can occasionally be in the range of 5,000-15,000.
NOTICE - solrj MultiCore{Params/Request/Response} have been renamed CoreAdmin{Params/Request/Response}
In the effort to clean up confusion around MultiCore usage, we have renamed the class that handle runtime core administration from "MultiCoreX" to CoreAdminX. Additionally, the path that the default MultiCoreRequest expects to hit is: /admin/cores rather then /admin/ multicore -- if you have an existing solr.xml file, you may want to update the "adminPath" attribute for a detailed change log, see: http://svn.apache.org/viewvc?view=rev&revision=685989 ryan
Re: IndexOutOfBoundsException
Yikes... not good. This shouldn't be due to anything you did wrong Ian... it looks like a lucene bug. Some questions: - what platform are you running on, and what JVM? - are you using multicore? (I fixed some index locking bugs recently) - are there any exceptions in the log before this? - how reproducible is this? -Yonik On Thu, Aug 14, 2008 at 2:47 PM, Ian Connor <[EMAIL PROTECTED]> wrote: > Hi, > > I have rebuilt my index a few times (it should get up to about 4 > Million but around 1 Million it starts to fall apart). > > Exception in thread "Lucene Merge Thread #0" > org.apache.lucene.index.MergePolicy$MergeException: > java.lang.IndexOutOfBoundsException: Index: 105, Size: 33 >at > org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323) >at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300) > Caused by: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33 >at java.util.ArrayList.rangeCheck(ArrayList.java:572) >at java.util.ArrayList.get(ArrayList.java:350) >at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260) >at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:188) >at > org.apache.lucene.index.SegmentReader.document(SegmentReader.java:670) >at > org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:349) >at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134) >at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998) >at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650) >at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214) >at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269) > > > When this happens, the disk usage goes right up and the indexing > really starts to slow down. I am using a Solr build from about a week > ago - so my Lucene is at 2.4 according to the war files. > > Has anyone seen this error before? Is it possible to tell which Array > is too large? Would it be an Array I am sending in or another internal > one? > > Regards, > Ian Connor >
Re: QueryResultsCache and DocSet filter
On Thu, Aug 14, 2008 at 3:15 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote: > The problem here is that the calls in SolrIndexSearcher don't appear to use > the QueryResultsCache if the filer is a DocSet rather than a List. Right... using a DocSet as part of the cache key would be pretty slow (key comparisons) and more memory intensive. > Is it recomended to re-work the caches to return Query objects and use that > as my filterList? Yes, that should work. -Yonik
Re: More files in index directory than expected
Chris Harris <[EMAIL PROTECTED]> wrote: > It's my understanding that if my mergeFactor is 10, then there > shouldn't be more than 11 segments in my index directory (10 segments, > plus an additional segment if a merge is in progress). Actually, mergeFactor 10 means each *level* will have <= 10 segments, where a level is roughly 10X the size of the previous level. EG after 10 segments (level 0) are flushed, they get merged into a single level 1 segment. Another 10 produces another level 1 segment. Etc. Until you have 10 level 1 segments, which then get merged into a single level 2 segment. The number of levels you have is logarithmic in the size of your index. > I'm noticing that _2pk, _2pl, _2pm, _2pn, _2po are sequential file > names, alphabetically speaking, and their last modified times are very > close to one another. Does this mean they're actually part of the same > segment, even though they are in separate files? No, these are different segments, just flushed shortly after one another in time. > I just discovered that one of my other indexes has over 11,000 tis > files. That's disturbing. I'm not sure if it would have the same > underlying cause. That does NOT sound right. Can you provide more details how this index is created/maintained? Mike
Re: Highlighting returns incorrect text on some results?
Paul, we had many highlighter-related changes since 1.2, so I suggest you try the nightly. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: pdovyda2 <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, August 14, 2008 2:56:42 PM > Subject: Highlighting returns incorrect text on some results? > > > This is kind of a strange issue, but when I submit a query and ask for > highlighting back, sometimes the highlighted text includes a question mark > at the beginning, although a question mark character does not appear in the > field that the highlighted text is taken from. > > I've put some sample XML output on the web at > http://ucair.cs.uiuc.edu/pdovyda2/problem.xml > If you look at the first and third highlights, you'll see what I'm talking > about. > > Besides looking a bit odd, it is causing my application to break because the > highlighted field is multivalued, and I was doing text matching to determine > which of the values was chosen for highlighting. > > Is this actually a bug, or have I just misconfigured something? By the way, > I am using the 1.2 release, I have not yet tried out a nightly build to see > if this is an old problem. > > Thanks, > Paul > -- > View this message in context: > http://www.nabble.com/Highlighting-returns-incorrect-text-on-some-results--tp18987598p18987598.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: QueryResultsCache and DocSet filter
The DocSet isn't part of the cache key. The key is usually just a simple string (e.g. companyId). They just return a DocSet. I think the user caches are fine. This DocSet is then used as a filter for the actual query. I believe it is this step that is slow. However, I am guessing that the solution is still to have the user caches return a Query object so that I can supply a List to SolrIndexSearcher, causing it to use the QueryResultsCache. Correct? - Original Message From: Yonik Seeley <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Thursday, August 14, 2008 1:41:50 PM Subject: Re: QueryResultsCache and DocSet filter On Thu, Aug 14, 2008 at 3:15 PM, Kevin Osborn <[EMAIL PROTECTED]> wrote: > The problem here is that the calls in SolrIndexSearcher don't appear to use > the QueryResultsCache if the filer is a DocSet rather than a List. Right... using a DocSet as part of the cache key would be pretty slow (key comparisons) and more memory intensive. > Is it recomended to re-work the caches to return Query objects and use that > as my filterList? Yes, that should work. -Yonik
Simple Searching Question
Hello, I inserted the following documents into Solr: --- 124 Jake Conk 125 Jake Conk --- id is the only required integer field. foobar_facet is a dynamic string field. When I try to search for anything with the word Jake in it the following ways I get no results. select?q=Jake select?q=Jake* I thought one of those two should work but the only way I got it to work was by specifying which field "Jake" is in along with a wild card. select?q=foobar_facet:Jake* 1) Does this mean for each field I would like to search if Jake exists I would have to add each field like I did above to the query? 2) How would I search if I want to find the name Jake anywhere in the string? The documentation (http://lucene.apache.org/java/docs/queryparsersyntax.html) states that I cannot use a wildcard as the first character such as *Jake* Thanks, - Jake
Re: Simple Searching Question
Hi Jake, What is the type of the foobar_facet field in your schema.xml ? Did you add foobar_facet as the default search field? On Fri, Aug 15, 2008 at 3:13 AM, Jake Conk <[EMAIL PROTECTED]> wrote: > Hello, > > I inserted the following documents into Solr: > > > --- > > > > 124 > Jake Conk > > > 125 > Jake Conk > > > > > --- > > id is the only required integer field. > foobar_facet is a dynamic string field. > > When I try to search for anything with the word Jake in it the > following ways I get no results. > > > select?q=Jake > select?q=Jake* > > > I thought one of those two should work but the only way I got it to > work was by specifying which field "Jake" is in along with a wild > card. > > > select?q=foobar_facet:Jake* > > > 1) Does this mean for each field I would like to search if Jake exists > I would have to add each field like I did above to the query? > > 2) How would I search if I want to find the name Jake anywhere in the > string? The documentation > (http://lucene.apache.org/java/docs/queryparsersyntax.html) states > that I cannot use a wildcard as the first character such as *Jake* > > Thanks, > - Jake > -- Regards, Shalin Shekhar Mangar.
Re: Highlighting returns incorrect text on some results?
A question mark huh? You sure there are no character encoding issues going on? Otis Gospodnetic wrote: Paul, we had many highlighter-related changes since 1.2, so I suggest you try the nightly. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: pdovyda2 <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Thursday, August 14, 2008 2:56:42 PM Subject: Highlighting returns incorrect text on some results? This is kind of a strange issue, but when I submit a query and ask for highlighting back, sometimes the highlighted text includes a question mark at the beginning, although a question mark character does not appear in the field that the highlighted text is taken from. I've put some sample XML output on the web at http://ucair.cs.uiuc.edu/pdovyda2/problem.xml If you look at the first and third highlights, you'll see what I'm talking about. Besides looking a bit odd, it is causing my application to break because the highlighted field is multivalued, and I was doing text matching to determine which of the values was chosen for highlighting. Is this actually a bug, or have I just misconfigured something? By the way, I am using the 1.2 release, I have not yet tried out a nightly build to see if this is an old problem. Thanks, Paul -- View this message in context: http://www.nabble.com/Highlighting-returns-incorrect-text-on-some-results--tp18987598p18987598.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: More files in index directory than expected
On Thu, Aug 14, 2008 at 2:01 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Chris Harris <[EMAIL PROTECTED]> wrote: >> It's my understanding that if my mergeFactor is 10, then there >> shouldn't be more than 11 segments in my index directory (10 segments, >> plus an additional segment if a merge is in progress). > > Actually, mergeFactor 10 means each *level* will have <= 10 segments, > where a level is roughly 10X the size of the previous level. > > EG after 10 segments (level 0) are flushed, they get merged into a > single level 1 segment. Another 10 produces another level 1 segment. > Etc. Until you have 10 level 1 segments, which then get merged into a > single level 2 segment. > > The number of levels you have is logarithmic in the size of your index. Thanks, that undoes a lot of my confusion. As for segment creation, is it accurate to say the following: Solr will write a new level 0 segment to disk each time an additional ramBufferSizeMB (default=32MB) worth of data have been added to the index. Furthermore, once that 32MB worth of data has been written to disk, those segment's files will never be modified. (The only time a segment will be modified is if you delete files from it, and that will only alter the segment's .del file, leaving .tis and friends alone.) >> I just discovered that one of my other indexes has over 11,000 tis >> files. That's disturbing. I'm not sure if it would have the same >> underlying cause. > > That does NOT sound right. Can you provide more details how this > index is created/maintained? I don't know exactly what happened, but I restarted Solr once or twice and then when I started adding documents again, Solr started deleting segment files, and brought things down from like 500GB to like 18GB. I feel like I read somewhere that Solr sometimes has trouble deleting segment files when running on Windows. (I'm on Windows right now.) I wonder if that's related. The main thing that bugs me about this index now is that the latest version of Luke (0.8.1) won't open it. ("Unknown format version: -6") The Solr Luke handler works fine with it, though.
Re: More files in index directory than expected
The main thing that bugs me about this index now is that the latest version of Luke (0.8.1) won't open it. ("Unknown format version: -6") The Solr Luke handler works fine with it, though. Luke comes with a released version of Lucene probably, while solr is using a later version. You have to start luke with the solr Lucene jar on the classpath. There are directions to do this type of thing on the Luke webpage. - Mark
Re: Simple Searching Question
Hi Shalin, "foobar_facet" is a dynamic field. Its defined in my schema like this: I have the default search field set to text. Can I use more than one default search field? text Thanks, - Jake On Thu, Aug 14, 2008 at 2:48 PM, Shalin Shekhar Mangar <[EMAIL PROTECTED]> wrote: > Hi Jake, > > What is the type of the foobar_facet field in your schema.xml ? > Did you add foobar_facet as the default search field? > > On Fri, Aug 15, 2008 at 3:13 AM, Jake Conk <[EMAIL PROTECTED]> wrote: > >> Hello, >> >> I inserted the following documents into Solr: >> >> >> --- >> >> >> >> 124 >> Jake Conk >> >> >> 125 >> Jake Conk >> >> >> >> >> --- >> >> id is the only required integer field. >> foobar_facet is a dynamic string field. >> >> When I try to search for anything with the word Jake in it the >> following ways I get no results. >> >> >> select?q=Jake >> select?q=Jake* >> >> >> I thought one of those two should work but the only way I got it to >> work was by specifying which field "Jake" is in along with a wild >> card. >> >> >> select?q=foobar_facet:Jake* >> >> >> 1) Does this mean for each field I would like to search if Jake exists >> I would have to add each field like I did above to the query? >> >> 2) How would I search if I want to find the name Jake anywhere in the >> string? The documentation >> (http://lucene.apache.org/java/docs/queryparsersyntax.html) states >> that I cannot use a wildcard as the first character such as *Jake* >> >> Thanks, >> - Jake >> > > > > -- > Regards, > Shalin Shekhar Mangar. >
Re: More files in index directory than expected
On Thu, Aug 14, 2008 at 6:31 PM, Chris Harris <[EMAIL PROTECTED]> wrote: > (The only time a > segment will be modified is if you delete files from it, and that will > only alter the segment's .del file, leaving .tis and friends alone.) Actually, these days .del files are even versioned. > I don't know exactly what happened, but I restarted Solr once or twice > and then when I started adding documents again, Solr started deleting > segment files, and brought things down from like 500GB to like 18GB. I > feel like I read somewhere that Solr sometimes has trouble deleting > segment files when running on Windows. (I'm on Windows right now.) I > wonder if that's related. Right... file deletion just tends to be delayed a bit longer on Windows. For example, if you do an optimize, all the segments will be merged into a single segment, but because a reader is still holding open the old index, you will see both sets of files. The old set of files will be deleted when a new IndexWriter is opened after the old reader is closed. -Yonik
Re: Best way to index without diacritics
On Thu, 14 Aug 2008 11:34:47 -0400 "Steven A Rowe" <[EMAIL PROTECTED]> wrote: [...] > The kind of filter Walter is talking about - a generalized language-aware > character normalization Solr/Lucene filter - does not yet exist. My guess is > that if/when it does materialize, both the Solr and the Lucene projects will > want to have it. Historically, most functionality shared by Solr and Lucene > is eventually hosted by Lucene, since Solr has a Lucene dependency, but not > vice-versa. > > So, yes, Solr would be responsible for hosting configuration for such a > filter, but the responsibility for doing something with the configuration > would be Lucene's responsibility, assuming that Lucene would (eventually) > host the filter and Solr would host a factory over the filter. > > Steve thanks for the thorough explanation ,Steve . B _ {Beto|Norberto|Numard} Meijome "Throughout the centuries there were [people] who took first steps down new paths armed only with their own vision." Ayn Rand I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Simple Searching Question
you're likely not copyField-ing *_facet to text, and we'd need to see what type of field it is to see how it will be analyzed at both search/index time. the default schema.xml file is pretty well documented, so you might want to spend some time looking thru it, and reading the commentslots of good info in there. cheers, rob On Thu, Aug 14, 2008 at 7:17 PM, Jake Conk <[EMAIL PROTECTED]> wrote: > Hi Shalin, > > "foobar_facet" is a dynamic field. Its defined in my schema like this: > > > > I have the default search field set to text. Can I use more than one > default search field? > > text > > Thanks, > - Jake > > > On Thu, Aug 14, 2008 at 2:48 PM, Shalin Shekhar Mangar > <[EMAIL PROTECTED]> wrote: >> Hi Jake, >> >> What is the type of the foobar_facet field in your schema.xml ? >> Did you add foobar_facet as the default search field? >> >> On Fri, Aug 15, 2008 at 3:13 AM, Jake Conk <[EMAIL PROTECTED]> wrote: >> >>> Hello, >>> >>> I inserted the following documents into Solr: >>> >>> >>> --- >>> >>> >>> >>> 124 >>> Jake Conk >>> >>> >>> 125 >>> Jake Conk >>> >>> >>> >>> >>> --- >>> >>> id is the only required integer field. >>> foobar_facet is a dynamic string field. >>> >>> When I try to search for anything with the word Jake in it the >>> following ways I get no results. >>> >>> >>> select?q=Jake >>> select?q=Jake* >>> >>> >>> I thought one of those two should work but the only way I got it to >>> work was by specifying which field "Jake" is in along with a wild >>> card. >>> >>> >>> select?q=foobar_facet:Jake* >>> >>> >>> 1) Does this mean for each field I would like to search if Jake exists >>> I would have to add each field like I did above to the query? >>> >>> 2) How would I search if I want to find the name Jake anywhere in the >>> string? The documentation >>> (http://lucene.apache.org/java/docs/queryparsersyntax.html) states >>> that I cannot use a wildcard as the first character such as *Jake* >>> >>> Thanks, >>> - Jake >>> >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> >
Re: Simple Searching Question
Rob, Actually I am copying *_facet to text. I have the following for copyField in my schema: This is my field configuration in my schema: Thanks, - Jake On Thu, Aug 14, 2008 at 5:49 PM, Rob Casson <[EMAIL PROTECTED]> wrote: > you're likely not copyField-ing *_facet to text, and we'd need to see > what type of field it is to see how it will be analyzed at both > search/index time. > > the default schema.xml file is pretty well documented, so you might > want to spend some time looking thru it, and reading the > commentslots of good info in there. > > cheers, > rob > > On Thu, Aug 14, 2008 at 7:17 PM, Jake Conk <[EMAIL PROTECTED]> wrote: >> Hi Shalin, >> >> "foobar_facet" is a dynamic field. Its defined in my schema like this: >> >> >> >> I have the default search field set to text. Can I use more than one >> default search field? >> >> text >> >> Thanks, >> - Jake >> >> >> On Thu, Aug 14, 2008 at 2:48 PM, Shalin Shekhar Mangar >> <[EMAIL PROTECTED]> wrote: >>> Hi Jake, >>> >>> What is the type of the foobar_facet field in your schema.xml ? >>> Did you add foobar_facet as the default search field? >>> >>> On Fri, Aug 15, 2008 at 3:13 AM, Jake Conk <[EMAIL PROTECTED]> wrote: >>> Hello, I inserted the following documents into Solr: --- 124 Jake Conk 125 Jake Conk --- id is the only required integer field. foobar_facet is a dynamic string field. When I try to search for anything with the word Jake in it the following ways I get no results. select?q=Jake select?q=Jake* I thought one of those two should work but the only way I got it to work was by specifying which field "Jake" is in along with a wild card. select?q=foobar_facet:Jake* 1) Does this mean for each field I would like to search if Jake exists I would have to add each field like I did above to the query? 2) How would I search if I want to find the name Jake anywhere in the string? The documentation (http://lucene.apache.org/java/docs/queryparsersyntax.html) states that I cannot use a wildcard as the first character such as *Jake* Thanks, - Jake >>> >>> >>> >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >>> >> >
Re: IndexOutOfBoundsException
I seem to be able to reproduce this very easily and the data is medline (so I am sure I can share it if needed with a quick email to check). - I am using fedora: %uname -a Linux ghetto5.projectlounge.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30 13:18:33 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux %java -version java version "1.7.0" IcedTea Runtime Environment (build 1.7.0-b21) IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode) - single core (will use shards but each machine just as one HDD so didn't see how cores would help but I am new at this) - next run I will keep the output to check for earlier errors - very and I can share code + data if that will help On Thu, Aug 14, 2008 at 4:23 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Yikes... not good. This shouldn't be due to anything you did wrong > Ian... it looks like a lucene bug. > > Some questions: > - what platform are you running on, and what JVM? > - are you using multicore? (I fixed some index locking bugs recently) > - are there any exceptions in the log before this? > - how reproducible is this? > > -Yonik > > On Thu, Aug 14, 2008 at 2:47 PM, Ian Connor <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I have rebuilt my index a few times (it should get up to about 4 >> Million but around 1 Million it starts to fall apart). >> >> Exception in thread "Lucene Merge Thread #0" >> org.apache.lucene.index.MergePolicy$MergeException: >> java.lang.IndexOutOfBoundsException: Index: 105, Size: 33 >>at >> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323) >>at >> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300) >> Caused by: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33 >>at java.util.ArrayList.rangeCheck(ArrayList.java:572) >>at java.util.ArrayList.get(ArrayList.java:350) >>at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260) >>at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:188) >>at >> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:670) >>at >> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:349) >>at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134) >>at >> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998) >>at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650) >>at >> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214) >>at >> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269) >> >> >> When this happens, the disk usage goes right up and the indexing >> really starts to slow down. I am using a Solr build from about a week >> ago - so my Lucene is at 2.4 according to the war files. >> >> Has anyone seen this error before? Is it possible to tell which Array >> is too large? Would it be an Array I am sending in or another internal >> one? >> >> Regards, >> Ian Connor >> > -- Regards, Ian Connor
Re: IndexOutOfBoundsException
Since this looks like more of a lucene issue, I've replied in [EMAIL PROTECTED] -Yonik On Thu, Aug 14, 2008 at 10:18 PM, Ian Connor <[EMAIL PROTECTED]> wrote: > I seem to be able to reproduce this very easily and the data is > medline (so I am sure I can share it if needed with a quick email to > check). > > - I am using fedora: > %uname -a > Linux ghetto5.projectlounge.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30 > 13:18:33 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux > %java -version > java version "1.7.0" > IcedTea Runtime Environment (build 1.7.0-b21) > IcedTea 64-Bit Server VM (build 1.7.0-b21, mixed mode) > - single core (will use shards but each machine just as one HDD so > didn't see how cores would help but I am new at this) > - next run I will keep the output to check for earlier errors > - very and I can share code + data if that will help > > On Thu, Aug 14, 2008 at 4:23 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: >> Yikes... not good. This shouldn't be due to anything you did wrong >> Ian... it looks like a lucene bug. >> >> Some questions: >> - what platform are you running on, and what JVM? >> - are you using multicore? (I fixed some index locking bugs recently) >> - are there any exceptions in the log before this? >> - how reproducible is this? >> >> -Yonik >> >> On Thu, Aug 14, 2008 at 2:47 PM, Ian Connor <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> I have rebuilt my index a few times (it should get up to about 4 >>> Million but around 1 Million it starts to fall apart). >>> >>> Exception in thread "Lucene Merge Thread #0" >>> org.apache.lucene.index.MergePolicy$MergeException: >>> java.lang.IndexOutOfBoundsException: Index: 105, Size: 33 >>>at >>> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:323) >>>at >>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:300) >>> Caused by: java.lang.IndexOutOfBoundsException: Index: 105, Size: 33 >>>at java.util.ArrayList.rangeCheck(ArrayList.java:572) >>>at java.util.ArrayList.get(ArrayList.java:350) >>>at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260) >>>at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:188) >>>at >>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:670) >>>at >>> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:349) >>>at >>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:134) >>>at >>> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3998) >>>at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3650) >>>at >>> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:214) >>>at >>> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:269) >>> >>> >>> When this happens, the disk usage goes right up and the indexing >>> really starts to slow down. I am using a Solr build from about a week >>> ago - so my Lucene is at 2.4 according to the war files. >>> >>> Has anyone seen this error before? Is it possible to tell which Array >>> is too large? Would it be an Array I am sending in or another internal >>> one? >>> >>> Regards, >>> Ian Connor >>> >> > > > > -- > Regards, > > Ian Connor >
Re: Index size vs. number of documents
: > I'm surprised, as you are, by the non-linearity. Out of curiosity, what is Unless the data in "stored" fields is significantly greater then "indexed" fields the Index size almost never grows linearly with the number of documents -- it's the number of unique terms that tends to primarily influence the size of the index. At some point someone on the java-user list who really understood the file formats wrote a really great forumla for estimating the size of the index assuming some ratios of unique terms per doc, but i can't find it now. -Hoss