several tokenizers in one field type
hi all, ( I'm using 1.3 nightly build from 15th June 08.) Is there some documentation about how analysers + tokenizers are applied in fields ? In particular, my question : - If I define 2 tokenizers in a fieldtype, only the first one is applied, the other is ignored. Is that because the 2nd tokenizer would have to work recursively on the tokens generated from the previous one? Would I have to create my custom tokenizer to perform the job of 2 existing tokenizers in one ? I'll send some other questions in a separate email... thx B _ {Beto|Norberto|Numard} Meijome "Build a system that even a fool can use, and only a fool will want to use it." George Bernard Shaw I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: several tokenizers in one field type
On Jun 24, 2008, at 12:07 AM, Norberto Meijome wrote: hi all, ( I'm using 1.3 nightly build from 15th June 08.) Is there some documentation about how analysers + tokenizers are applied in fields ? In particular, my question : best docs are here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters - If I define 2 tokenizers in a fieldtype, only the first one is applied, the other is ignored. Is that because the 2nd tokenizer would have to work recursively on the tokens generated from the previous one? Would I have to create my custom tokenizer to perform the job of 2 existing tokenizers in one ? if you define two tokenizers, solr should throw an error the second one can't do anything. The tokenizer breaks the input stream into a stream of tokens, then token filters can modify these tokens. ryan
(Edge)NGram tokenizer interaction with other filters
hi everyone, if I define a field as I would expect that, when pushing data into it, this is what would happen: - Stop words removed by StopFilterFactory - content broken into several 'words' as per WordDelimiterFilterFactory. - the result of all this passed to EdgeNGram (or nGram) tokenizer so, when indexing 'The Veronicas', only 'Veronicas' would reach the NGram tokenizer What I find is that the n-gram tokenizers kick in first, and the filters after, making it a rather moot exercise. I've confirmed the steps in analysis.jsp : Index Analyzer org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=15, minGramSize=2} [..] org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, ignoreCase=true, enablePositionIncrements=true} [..] org.apache.solr.analysis.LowerCaseFilterFactory {} [...] org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} [...] What am I doing / understanding wrong? thanks!! B _ {Beto|Norberto|Numard} Meijome Windows caters to everyone as though they are idiots. UNIX makes no such assumption. It assumes you know what you are doing, and presents the challenge of figuring it out for yourself if you don't. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: several tokenizers in one field type
On Tue, 24 Jun 2008 00:14:57 -0700 Ryan McKinley <[EMAIL PROTECTED]> wrote: > best docs are here: > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters yes, I've been reading that already , thanks :) > > > - If I define 2 tokenizers in a fieldtype, only the first one is > > applied, the > > other is ignored. Is that because the 2nd tokenizer would have to work > > recursively on the tokens generated from the previous one? Would I > > have to > > create my custom tokenizer to perform the job of 2 existing > > tokenizers in one ? > > if you define two tokenizers, solr should throw an error the > second one can't do anything. no error that I can see - i'm using the default log settings from the solr test app bundled with nightly build. > The tokenizer breaks the input stream into a stream of tokens, then > token filters can modify these tokens. ok, that makes sense.That *should* explain what I described in my other email.( Subject: (Edge)NGram tokenizer interaction with other filters ) thanks a lot Ryan :) B _ {Beto|Norberto|Numard} Meijome "Tell a person you're the Metatron and they stare at you blankly. Mention something out of a Charleton Heston movie and suddenly everyone's a Theology scholar!" Dogma I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Parser of Response XML
Hi, is any class is available in SOLR API to parse the response XML? Regards, Ranjeet
Re: Parser of Response XML
org.apache.solr.client.solrj.impl.XMLResponseParser On Tue, Jun 24, 2008 at 3:06 PM, Ranjeet <[EMAIL PROTECTED]> wrote: > Hi, > > is any class is available in SOLR API to parse the response XML? > > Regards, > Ranjeet -- --Noble Paul
Re: (Edge)NGram tokenizer interaction with other filters
One tokenizer is followed by filters. I think this all might be a bit clearer if you read the chapter about Analyzers in Lucene in Action if you have a copy. I think if you try to break down that "the result of all this passed to " into something more concrete and real you will see how things (should) work. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Norberto Meijome <[EMAIL PROTECTED]> > To: SOLR-Usr-ML > Sent: Tuesday, June 24, 2008 3:19:09 AM > Subject: (Edge)NGram tokenizer interaction with other filters > > hi everyone, > > > if I define a field as > > > positionIncrementGap="100"> > > > words="stopwords.txt" enablePositionIncrements="true"/> > > > generateWordParts="1" generateNumberParts="1" > catenateWords="1" > catenateNumbers="1" catenateAll="1"/> > > > class="org.apache.solr.analysis.EdgeNGramTokenizerFactory" > minGramSize="2" maxGramSize="15"/> > > > > > > > words="stopwords.txt"/> > > > generateWordParts="1" generateNumberParts="1" > catenateWords="0" > catenateNumbers="0" catenateAll="0"/> > > > class="org.apache.solr.analysis.EdgeNGramTokenizerFactory" > minGramSize="2" maxGramSize="15"/> > > > > > > > I would expect that, when pushing data into it, this is what would happen: > - Stop words removed by StopFilterFactory > - content broken into several 'words' as per WordDelimiterFilterFactory. > - the result of all this passed to EdgeNGram (or nGram) tokenizer > > so, when indexing 'The Veronicas', only 'Veronicas' would reach the NGram > tokenizer > > What I find is that the n-gram tokenizers kick in first, and the filters > after, > making it a rather moot exercise. I've confirmed the steps in analysis.jsp : > > Index Analyzer > org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=15, minGramSize=2} > [..] > org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, > ignoreCase=true, enablePositionIncrements=true} > [..] > org.apache.solr.analysis.LowerCaseFilterFactory {} > [...] > org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {} > [...] > > What am I doing / understanding wrong? > > thanks!! > B > _ > {Beto|Norberto|Numard} Meijome > > Windows caters to everyone as though they are idiots. UNIX makes no such > assumption. It assumes you know what you are doing, and presents the > challenge > of figuring it out for yourself if you don't. > > I speak for myself, not my employer. Contents may be hot. Slippery when wet. > Reading disclaimers makes you go blind. Writing them is worse. You have been > Warned.
SOLR-139 (Support updateable/modifiable documents)
Hi, Does anyone know if SOLR-139 (Support updateable/modifiable documents) will make it back into the 1.3 release? I'm looking for a way to append data to a multivalued field in a document over a period of time (in which the document represents a forum thread and the multivalued field represents the messages attached to this thread). Thanks, Dave Dave Searle Lead Developer MPS Magicalia Ltd. Thank you for your interest in Magicalia Media. www.magicalia.com Special interest communities are Magicalia's mission in life. Magicalia publishes specialist websites and magazine titles for people who have a passion for their hobby, sport or area of interest. For further information, please call 01689 899200 or fax 01689 899266. Magicalia Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN Registered: England & Wales, Registered Number: 3828584, VAT Number: 744 4983 00 Magicalia Publishing Ltd, Berwick House, 8-10 Knoll Rise, Orpington, BR6 0EL Registered: England & Wales, Registered Number: 5649018, VAT Number: 872 8179 83 Magicalia Media Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN Registered: England & Wales, Registered Number: 5780320, VAT Number: 888 0357 82 This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to which they are addressed. If you have received this email in error please reply to this email and then delete it. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Magicalia. The recipient should check this email and any attachments for the presence of viruses. Magicalia accepts no liability for any damage caused by any virus transmitted by this email. Magicalia may regularly and randomly monitor outgoing and incoming emails and other telecommunications on its email and telecommunications systems. By replying to this email you give your consent to such monitoring. Copyright in this e-mail and any attachments created by Magicalia Media belongs to Magicalia Media.
Re: (Edge)NGram tokenizer interaction with other filters
On Tue, 24 Jun 2008 04:54:46 -0700 (PDT) Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > One tokenizer is followed by filters. I think this all might be a bit > clearer if you read the chapter about Analyzers in Lucene in Action if you > have a copy. I think if you try to break down that "the result of all this > passed to " into something more concrete and real you will see how things > (should) work. thanks Otis, from this and Ryan's previous reply I understand I was mistaken on how I was seeing the process - i was expecting the filters / tokenizers to work as processes with the output of one going to the input of the next , in the order shown in fieldType definition. .. now that I write this i remember reading some posts on this list about doing something like this ... open-pipe ? anyway, it makes sense...not what I was hoping for, but it's what I have to work with. Now, if only I can get n-gram to work with search terms > minGramSize :P Thanks for your time, help and recommendation of Lucene in Action. B _ {Beto|Norberto|Numard} Meijome "The greatest dangers to liberty lurk in insidious encroachment by men of zeal, well-meaning but without understanding." Justice Louis D. Brandeis I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
SOLR-469 - bad patch?
It seems the new patch @ https://issues.apache.org/jira/browse/ SOLR-469 is x2 the size but turns out the patch itself might be bad? Ie, it dumps build.xml twice, is it just me? Thanks. - Jon
Re: SOLR-139 (Support updateable/modifiable documents)
I don't know if SOLR-139 will make it into 1.3, but from your brief description, I'd say you might want to consider a different schema for your data. Stuffing thread messages in the same doc that represents a thread may not be the best choice. Of course, you may have good reasons for doing that, I just don't know them. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Dave Searle <[EMAIL PROTECTED]> > To: "solr-user@lucene.apache.org" > Sent: Tuesday, June 24, 2008 8:34:47 AM > Subject: SOLR-139 (Support updateable/modifiable documents) > > Hi, > > > > Does anyone know if SOLR-139 (Support updateable/modifiable documents) will > make > it back into the 1.3 release? I'm looking for a way to append data to a > multivalued field in a document over a period of time (in which the document > represents a forum thread and the multivalued field represents the messages > attached to this thread). > > > > Thanks, > > Dave > > > > Dave Searle > Lead Developer MPS > Magicalia Ltd. > > > > > > Thank you for your interest in Magicalia Media. > > www.magicalia.com > > Special interest communities are Magicalia's mission in life. Magicalia > publishes specialist websites > and magazine titles for people who have a passion for their hobby, sport or > area > of interest. > > For further information, please call 01689 899200 or fax 01689 899266. > > Magicalia Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN > Registered: England & Wales, Registered Number: 3828584, VAT Number: 744 4983 > 00 > > Magicalia Publishing Ltd, Berwick House, 8-10 Knoll Rise, Orpington, BR6 0EL > Registered: England & Wales, Registered Number: 5649018, VAT Number: 872 8179 > 83 > > Magicalia Media Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN > Registered: England & Wales, Registered Number: 5780320, VAT Number: 888 0357 > 82 > > This email and any files transmitted with it are confidential and intended > solely for the use of the individual > or entity to which they are addressed. If you have received this email in > error > please reply to this email and > then delete it. Please note that any views or opinions presented in this > email > are solely those of the author > and do not necessarily represent those of Magicalia. The recipient should > check > this email and any > attachments for the presence of viruses. Magicalia accepts no liability for > any > damage caused by any virus > transmitted by this email. Magicalia may regularly and randomly monitor > outgoing > and incoming emails and > other telecommunications on its email and telecommunications systems. By > replying to this email you give > your consent to such monitoring. Copyright in this e-mail and any attachments > created by Magicalia Media > belongs to Magicalia Media.
Re: Accented search
Here is how I did it (the code is from memory so it might not be correct 100%): private boolean hasAccents; private Token filteredToken; public final Token next() throws IOException { if (hasAccents) { hasAccents = false; return filteredToken; } Token t = input.next(); String filteredText = removeAccents(t.termText()); if (filteredText.equals(t.termText()) { //no accents return t; } else { filteredToken = (Token) t.clone(); filteredToken.setTermText(filteredText): filteredToken.setPositionIncrement(0); hasAccents = true; } return t; } On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[EMAIL PROTECTED]> wrote: > Regarding indexing words with accented and unaccented characters with > positionIncrement zero: > > Chris Hostetter wrote: > >> >> you don't really need a custom tokenizer -- just a buffered TokenFilter >> that clones the original token if it contains accent chars, mutates the >> clone, and then emits it next with a positionIncrement of 0. >> >> > Could someone expand on how to implement this technique of buffering and > cloning? > > Thanks, > > Phil > -- Regards, Cuong Hoang
Otis : Re: n-Gram, only works with queries of 2 letters
On Tue, 24 Jun 2008 09:10:58 +1000 Norberto Meijome <[EMAIL PROTECTED]> wrote: > On Mon, 23 Jun 2008 05:33:49 -0700 (PDT) > Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > When you add &debugQuery=true to the request, what does your query look > > like after parsing? hi Otis, can you provide some insight as to what is going on here? am I only supposed to use search terms of length = minGramSize against fields tokenized with nGramTokenizer ? Any pointers will be greatly appreciated. TIA for your time, Beto > > Hi Otis, > sorry, i should have sent this before too. > With minGramSize = 3 , same data, clean server start, index rebuilt. 2 cases > shown below, one not working, one working. The 4 letter case (not working) > seems to be parsed properly, and as expected one of the tokens generated is > same as my 3 letter query that does work. > > DOESN'T WORK AS EXPECTED CASE > > − > > − > > 0 > 53 > − > > eche > artist_ngram > true > > > > − > > eche > eche > PhraseQuery(artist_ngram:"ech che eche") > artist_ngram:"ech che eche" > > − > > 52.0 > − > > 0.0 > − > > 0.0 > > − > > 0.0 > > − > > 0.0 > > − > > 0.0 > > − > > 0.0 > > > − > > 52.0 > − > > 22.0 > > − > > 0.0 > > − > > 0.0 > > − > > 0.0 > > − > > 30.0 > > > > > > > --- > > WORKS AS EXPECTED CASE > > http://localhost:8983/solr/_test_/select?q=ech&df=artist_ngram&debugQuery=true > > − > > − > > 0 > 57 > − > > ech > artist_ngram > true > > > − > > − > > Depeche Mode > Depeche Mode > Depeche Mode > 2008-06-23T06:28:36.758Z > > > − > > ech > ech > artist_ngram:ech > artist_ngram:ech > − > > − > > > 0.90429556 = (MATCH) fieldWeight(artist_ngram:ech in 43), product of: > 1.0 = tf(termFreq(artist_ngram:ech)=1) > 5.787492 = idf(docFreq=1, numDocs=240) > 0.15625 = fieldNorm(field=artist_ngram, doc=43) > > > − > > 57.0 > − > > 0.0 > − > > 0.0 > > − > > 0.0 > > − > > 0.0 > > − > > 0.0 > > − > > 0.0 > > > − > > 57.0 > − > > 57.0 > > − > > 0.0 > > − > > 0.0 > > − > > 0.0 > > − > > 0.0 > > > > > > > Thanks, > B _ {Beto|Norberto|Numard} Meijome Software QA is like cleaning my cat's litter box: Sift out the big chunks. Stir in the rest. Hope it doesn't stink. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
RE: SOLR-139 (Support updateable/modifiable documents)
Thanks Otis, At the moment I have an index of forum messages (each message being a separate doc). Results are displayed on a per message basis, however, I would like to group the results via their thread. Apart from using a facet on the thread title (which would lose relevancy), I cannot see a way of doing this. So my idea was to build a new index with the thread being the main document entity and a multivalued field for the message data. Using the work done in SOLR-139 I could then update this field as new messages are posted (and any other thread fields such as message count, date of the last post and so on) Without SOLR 139, I would currently have to re-index the whole thread; some threads having thousands of messages which could obviously take some time! :) Am I looking at this from the wrong angle? Have you come across similar scenarios? Thanks for your time, Dave -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 24 June 2008 15:33 To: solr-user@lucene.apache.org Subject: Re: SOLR-139 (Support updateable/modifiable documents) I don't know if SOLR-139 will make it into 1.3, but from your brief description, I'd say you might want to consider a different schema for your data. Stuffing thread messages in the same doc that represents a thread may not be the best choice. Of course, you may have good reasons for doing that, I just don't know them. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Dave Searle <[EMAIL PROTECTED]> > To: "solr-user@lucene.apache.org" > Sent: Tuesday, June 24, 2008 8:34:47 AM > Subject: SOLR-139 (Support updateable/modifiable documents) > > Hi, > > > > Does anyone know if SOLR-139 (Support updateable/modifiable documents) will > make > it back into the 1.3 release? I'm looking for a way to append data to a > multivalued field in a document over a period of time (in which the document > represents a forum thread and the multivalued field represents the messages > attached to this thread). > > > > Thanks, > > Dave > > > > Dave Searle > Lead Developer MPS > Magicalia Ltd. > > > > > > Thank you for your interest in Magicalia Media. > > www.magicalia.com > > Special interest communities are Magicalia's mission in life. Magicalia > publishes specialist websites > and magazine titles for people who have a passion for their hobby, sport or > area > of interest. > > For further information, please call 01689 899200 or fax 01689 899266. > > Magicalia Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN > Registered: England & Wales, Registered Number: 3828584, VAT Number: 744 4983 > 00 > > Magicalia Publishing Ltd, Berwick House, 8-10 Knoll Rise, Orpington, BR6 0EL > Registered: England & Wales, Registered Number: 5649018, VAT Number: 872 8179 > 83 > > Magicalia Media Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN > Registered: England & Wales, Registered Number: 5780320, VAT Number: 888 0357 > 82 > > This email and any files transmitted with it are confidential and intended > solely for the use of the individual > or entity to which they are addressed. If you have received this email in > error > please reply to this email and > then delete it. Please note that any views or opinions presented in this > email > are solely those of the author > and do not necessarily represent those of Magicalia. The recipient should > check > this email and any > attachments for the presence of viruses. Magicalia accepts no liability for > any > damage caused by any virus > transmitted by this email. Magicalia may regularly and randomly monitor > outgoing > and incoming emails and > other telecommunications on its email and telecommunications systems. By > replying to this email you give > your consent to such monitoring. Copyright in this e-mail and any attachments > created by Magicalia Media > belongs to Magicalia Media. __ Information from ESET Smart Security, version of virus signature database 3213 (20080624) __ The message was checked by ESET Smart Security. http://www.eset.com __ Information from ESET Smart Security, version of virus signature database 3213 (20080624) __ The message was checked by ESET Smart Security. http://www.eset.com
Re: SOLR-139 (Support updateable/modifiable documents)
On Tue, 24 Jun 2008 16:04:24 +0100 Dave Searle <[EMAIL PROTECTED]> wrote: > At the moment I have an index of forum messages (each message being a > separate doc). Results are displayed on a per message basis, however, I would > like to group the results via their thread. Apart from using a facet on the > thread title (which would lose relevancy), I cannot see a way of doing this. what about storing the thread id (+other information needed to regenerate the messages in order) instead of the subject as a facet ? or just use the thread_id as a filter... B _ {Beto|Norberto|Numard} Meijome Hildebrant's Principle: If you don't know where you are going, any road will get you there. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
RE: SOLR-139 (Support updateable/modifiable documents)
I am currently storing the thread id within the message index, however, although this would allow me to sort, it doesn't help with the grouping of threads based on relevancy. See the idea is to index message data in the thread documents and then boost the message mutlivalued field over the thread title and thread description (which in my opinion, would give better results). The user, when presented with the thread results, could then drill down into a particular thread's messages using the same search terms on the message index (but filtered by the referring thread id) -Original Message- From: Norberto Meijome [mailto:[EMAIL PROTECTED] Sent: 24 June 2008 16:16 To: solr-user@lucene.apache.org Subject: Re: SOLR-139 (Support updateable/modifiable documents) On Tue, 24 Jun 2008 16:04:24 +0100 Dave Searle <[EMAIL PROTECTED]> wrote: > At the moment I have an index of forum messages (each message being a > separate doc). Results are displayed on a per message basis, however, I would > like to group the results via their thread. Apart from using a facet on the > thread title (which would lose relevancy), I cannot see a way of doing this. what about storing the thread id (+other information needed to regenerate the messages in order) instead of the subject as a facet ? or just use the thread_id as a filter... B _ {Beto|Norberto|Numard} Meijome Hildebrant's Principle: If you don't know where you are going, any road will get you there. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. __ Information from ESET Smart Security, version of virus signature database 3213 (20080624) __ The message was checked by ESET Smart Security. http://www.eset.com __ Information from ESET Smart Security, version of virus signature database 3213 (20080624) __ The message was checked by ESET Smart Security. http://www.eset.com Thank you for your interest in Magicalia Media. www.magicalia.com Special interest communities are Magicalia's mission in life. Magicalia publishes specialist websites and magazine titles for people who have a passion for their hobby, sport or area of interest. For further information, please call 01689 899200 or fax 01689 899266. Magicalia Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN Registered: England & Wales, Registered Number: 3828584, VAT Number: 744 4983 00 Magicalia Publishing Ltd, Berwick House, 8-10 Knoll Rise, Orpington, BR6 0EL Registered: England & Wales, Registered Number: 5649018, VAT Number: 872 8179 83 Magicalia Media Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN Registered: England & Wales, Registered Number: 5780320, VAT Number: 888 0357 82 This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to which they are addressed. If you have received this email in error please reply to this email and then delete it. Please note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Magicalia. The recipient should check this email and any attachments for the presence of viruses. Magicalia accepts no liability for any damage caused by any virus transmitted by this email. Magicalia may regularly and randomly monitor outgoing and incoming emails and other telecommunications on its email and telecommunications systems. By replying to this email you give your consent to such monitoring. Copyright in this e-mail and any attachments created by Magicalia Media belongs to Magicalia Media.
Re: Accented search
climbingrose wrote: Here is how I did it (the code is from memory so it might not be correct 100%): private boolean hasAccents; private Token filteredToken; public final Token next() throws IOException { if (hasAccents) { hasAccents = false; return filteredToken; } Token t = input.next(); String filteredText = removeAccents(t.termText()); if (filteredText.equals(t.termText()) { //no accents return t; } else { filteredToken = (Token) t.clone(); filteredToken.setTermText(filteredText): filteredToken.setPositionIncrement(0); hasAccents = true; } return t; } On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[EMAIL PROTECTED]> wrote: Regarding indexing words with accented and unaccented characters with positionIncrement zero: Chris Hostetter wrote: you don't really need a custom tokenizer -- just a buffered TokenFilter that clones the original token if it contains accent chars, mutates the clone, and then emits it next with a positionIncrement of 0. Could someone expand on how to implement this technique of buffering and cloning? Thanks, Phil I just was facing the same issue and came up with the following as a solution. I changed the Schema.xml file so that for the text field the analyzers and filters are as follows: words="stopwords.txt"/> generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/> protected="protwords.txt"/> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> words="stopwords.txt"/> generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/> protected="protwords.txt"/> These two lines are the new ones: the first line invokes a custom filter that I borrowed and modified that turns decomposed unicode ( like Pe'rez ) to the composed form ( Pérez ) the second line replaces accented characters with their unaccented equivalents ( Perez ) For the custom filter to work, you must create a lib directory as a sibling to the conf directory and place the jar files containing the custom filter there. The Jars can be downloaded from the blacklight subversion repository at: http://blacklight.rubyforge.org/svn/trunk/solr/lib/ The SolrPlugin.jar contains the classes UnicodeNormalizationFilter and UnicodeNormalizationFilterFactory which merely invokes the Normalizer.normalize function in the normalizer jar (which is taken from the marc4j distribution and which is a subset og the icu4j library) -Robert Haschart
Re: SOLR-139 (Support updateable/modifiable documents)
On Tue, 24 Jun 2008 16:34:44 +0100 Dave Searle <[EMAIL PROTECTED]> wrote: > I am currently storing the thread id within the message index, however, > although this would allow me to sort, it doesn't help with the grouping of > threads based on relevancy. See the idea is to index message data in the > thread documents and then boost the message mutlivalued field over the thread > title and thread description (which in my opinion, would give better results). > > The user, when presented with the thread results, could then drill down into > a particular thread's messages using the same search terms on the message > index (but filtered by the referring thread id) It is very very likely that I am just quite late and I'm tired ...but I think the approach of having one document per forum message would allow you to implement what you want... and, otoh, not too sure the multivalued field would work as well. ie, store the link to the start of the thread and thread subject in all docs, as well as store link to post and text of post (and thread id, etc, as needed). boost the content of the posting over the subject. (but I have a feeling this may not be what you have in mind when you say "grouping of threads based on relevancy" .. is it? ) B _ {Beto|Norberto|Numard} Meijome "I didn't attend the funeral, but I sent a nice letter saying I approved of it." Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: SOLR-469 - bad patch?
I've just uploaded a new patch which applies cleanly on the trunk. Thanks! On Tue, Jun 24, 2008 at 7:35 PM, Jon Baer <[EMAIL PROTECTED]> wrote: > It seems the new patch @ https://issues.apache.org/jira/browse/SOLR-469 is > x2 the size but turns out the patch itself might be bad? > > Ie, it dumps build.xml twice, is it just me? > > Thanks. > > - Jon > -- Regards, Shalin Shekhar Mangar.
RE: never desallocate RAM...during search
Hi, I'm having problems with the patch. With this schema.xml: > If I send documents with a content smaller than 3 I have an exception during the indexing. If I change the maxLength to, for example, 30 the documents that before gave the exception are now indexed correctly. The exception is: GRAVE: java.lang.StringIndexOutOfBoundsException: String index out of range: 3 at java.lang.String.substring(Unknown Source) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:262) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProc essorFactory.java:66) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateReque stHandler.java:196) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateR equestHandler.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase. java:125) at org.apache.solr.core.SolrCore.execute(SolrCore.java:965) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3 38) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 272) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application FilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh ain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja va:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja va:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128 ) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102 ) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java :109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java: 852) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(H ttp11AprProtocol.java:584) at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508) at java.lang.Thread.run(Unknown Source) I hope this help. Thanks. Rober. -Mensaje original- De: Erik Hatcher [mailto:[EMAIL PROTECTED] Enviado el: lunes, 23 de junio de 2008 20:49 Para: solr-user@lucene.apache.org Asunto: Re: never desallocate RAM...during search On Jun 23, 2008, at 8:16 AM, <[EMAIL PROTECTED]> wrote: > I was doing something similar to your solution to have better > searching > times. > I download you patch but I have a problem in one class. I'm not sure > if I'm > doing something wrong but if I what to compile the proyect I must > change in > IndexSchema: > > //private Similarity similarity; > > AND PUT: > > private SimilarityFactory similarityFactory; > > I'm doing something incorrectly or is a little bug? It's because the patch is out of sync with trunk. The SimilarityFactory was added recently. Erik
solr-14 help
hi all :) last week I reworked an older patch for SOLR-14 https://issues.apache.org/jira/browse/SOLR-14 this functionality is actually fairly important for our ongoing migration to solr, so I'd really love to get SOLR-14 into 1.3. but open-source being what it is, my super-important feature is most people's not-so-important feature :) anyway, I'm not a java programmer at all, so my work is probably sub-par. but it seems like low-hanging fruit for reasonably skilled java folks, so if there is anyone out there willing to lend a hand or just waiting for a (simple) opportunity to get involved, I'd be much appreciative. otherwise I guess it goes into my bin of locally applied patches. thanks --Geoff
Re: Attempting dataimport using FileListEntityProcessor
I do want to import all documents. My understanding of the way things work, correct me if I'm wrong, is that there can be a certain number of documents included in a single atomic update. Instead of having all my 16 Million documents be part of a single update (that could more easily fail being so big), I was thinking that it would be better to be able to stipulate how many docs are part of an update and my 16 Million doc import would consist of 16M/100 updates. Shalin Shekhar Mangar wrote: > > Hi Mike, > > Just curious to know the use-case here. Why do you want to limit updates > to > 100 instead of importing all documents? > > On Tue, Jun 24, 2008 at 10:23 AM, mike segv <[EMAIL PROTECTED]> wrote: > >> >> That fixed it. >> >> If I'm inserting millions of documents, how do I control docs/update? >> E.g. >> if there are 50K docs per file, I'm thinking that I should probably code >> up >> my own DataSource that allows me to stipulate docs/update. Like say, 100 >> instead of 50K. Does this make sense? >> >> Mike >> >> >> Noble Paul നോബിള് नोब्ळ् wrote: >> > >> > hi , >> > You have not registered any datasources . the second entity needs a >> > datasource. >> > Remove the dataSource="null" and add a name for the second entity >> > (good practice). No need for baseDir attribute for second entity . >> > See the modified xml added below >> > --Noble >> > >> > >> > >> > >> > > > newerThan="'NOW-10DAYS'" recursive="true" rootEntity="false" >> > dataSource="null" baseDir="/san/tomcat-services/solr-medline"> >> > > > forEach="/MedlineCitation" >> > url="${f.fileAbsolutePath}" > >> > >> > >> > >> > >> > >> > >> > On Tue, Jun 24, 2008 at 6:39 AM, mike segv <[EMAIL PROTECTED]> wrote: >> >> >> >> I'm trying to use the fileListEntityProcessor to add some xml >> documents >> >> to a >> >> solr index. I'm running a nightly version of solr-1.3 with SOLR-469 >> and >> >> SOLR-563. I've been able to successfuly run the slashdot >> httpDataSource >> >> example. My data-config.xml file loads without errors. When I >> attempt >> >> the >> >> full-import command I get the exception below. Thanks for any help. >> >> >> >> Mike >> >> >> >> WARNING: No lockType configured for >> >> /san/tomcat-services/solr-medline/solr/data/index/ assuming 'simple' >> >> Jun 23, 2008 7:59:49 PM >> org.apache.solr.handler.dataimport.DataImporter >> >> doFullImport >> >> SEVERE: Full Import failed >> >> java.lang.RuntimeException: java.lang.NullPointerException >> >>at >> >> >> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:97) >> >>at >> >> >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:212) >> >>at >> >> >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:166) >> >>at >> >> >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:149) >> >>at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:286) >> >>at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:312) >> >>at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179) >> >>at >> >> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:140) >> >>at >> >> >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335) >> >>at >> >> >> org.apache.solr.handler.dataimport.DataImporter.rumCmd(DataImporter.java:386) >> >>at >> >> >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377) >> >> Caused by: java.lang.NullPointerException >> >>at java.io.Reader.(Reader.java:61) >> >>at java.io.BufferedReader.(BufferedReader.java:76) >> >>at >> com.bea.xml.stream.MXParser.checkForXMLDecl(MXParser.java:775) >> >>at com.bea.xml.stream.MXParser.setInput(MXParser.java:806) >> >>at >> >> >> com.bea.xml.stream.MXParserFactory.createXMLStreamReader(MXParserFactory.java:261) >> >>at >> >> >> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:93) >> >>... 10 more >> >> >> >> Here is my data-config: >> >> >> >> >> >> >> >> > >> newerThan="'NOW-10DAYS'" recursive="true" rootEntity="false" >> >> dataSource="null" baseDi >> >> r="/san/tomcat-services/solr-medline"> >> >> > >> url="${f.fileAbsolutePath}" dataSource="null"> >> >> >> >> >> >> >> >> >> >> >> >> >> >> And a snippet from an xml file: >> >> >> >> 12236137 >> >> >> >> 1980 >> >> 01 >> >> 03 >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Attempting-dataimport-using-FileListEntityProcessor-tp18081671p18081671.html >> >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> >> >> >> > >> > >> > >> > -- >> > --Noble Paul >> > >> > >> >> -- >> View
Re: Attempting dataimport using FileListEntityProcessor
Ok, I got your point. DataImportHandler currently creates documents and adds them one-by-one to Solr. A commit/optimize is called once after all documents are finished. If a document fails to add due to any exception then the import fails. You can still achieve the functionality you want by setting maxDocs under the autoCommit section in solrconfig.xml On Tue, Jun 24, 2008 at 11:01 PM, mike segv <[EMAIL PROTECTED]> wrote: > > I do want to import all documents. My understanding of the way things > work, > correct me if I'm wrong, is that there can be a certain number of documents > included in a single atomic update. Instead of having all my 16 Million > documents be part of a single update (that could more easily fail being so > big), I was thinking that it would be better to be able to stipulate how > many docs are part of an update and my 16 Million doc import would consist > of 16M/100 updates. > > > Shalin Shekhar Mangar wrote: > > > > Hi Mike, > > > > Just curious to know the use-case here. Why do you want to limit updates > > to > > 100 instead of importing all documents? > > > > On Tue, Jun 24, 2008 at 10:23 AM, mike segv <[EMAIL PROTECTED]> wrote: > > > >> > >> That fixed it. > >> > >> If I'm inserting millions of documents, how do I control docs/update? > >> E.g. > >> if there are 50K docs per file, I'm thinking that I should probably code > >> up > >> my own DataSource that allows me to stipulate docs/update. Like say, > 100 > >> instead of 50K. Does this make sense? > >> > >> Mike > >> > >> > >> Noble Paul നോബിള് नोब्ळ् wrote: > >> > > >> > hi , > >> > You have not registered any datasources . the second entity needs a > >> > datasource. > >> > Remove the dataSource="null" and add a name for the second entity > >> > (good practice). No need for baseDir attribute for second entity . > >> > See the modified xml added below > >> > --Noble > >> > > >> > > >> > > >> > > >> > >> > newerThan="'NOW-10DAYS'" recursive="true" rootEntity="false" > >> > dataSource="null" baseDir="/san/tomcat-services/solr-medline"> > >> > >> > forEach="/MedlineCitation" > >> > url="${f.fileAbsolutePath}" > > >> > > >> > > >> > > >> > > >> > > >> > > >> > On Tue, Jun 24, 2008 at 6:39 AM, mike segv <[EMAIL PROTECTED]> wrote: > >> >> > >> >> I'm trying to use the fileListEntityProcessor to add some xml > >> documents > >> >> to a > >> >> solr index. I'm running a nightly version of solr-1.3 with SOLR-469 > >> and > >> >> SOLR-563. I've been able to successfuly run the slashdot > >> httpDataSource > >> >> example. My data-config.xml file loads without errors. When I > >> attempt > >> >> the > >> >> full-import command I get the exception below. Thanks for any help. > >> >> > >> >> Mike > >> >> > >> >> WARNING: No lockType configured for > >> >> /san/tomcat-services/solr-medline/solr/data/index/ assuming 'simple' > >> >> Jun 23, 2008 7:59:49 PM > >> org.apache.solr.handler.dataimport.DataImporter > >> >> doFullImport > >> >> SEVERE: Full Import failed > >> >> java.lang.RuntimeException: java.lang.NullPointerException > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:97) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:212) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:166) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:149) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:286) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:312) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:140) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.DataImporter.rumCmd(DataImporter.java:386) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377) > >> >> Caused by: java.lang.NullPointerException > >> >>at java.io.Reader.(Reader.java:61) > >> >>at java.io.BufferedReader.(BufferedReader.java:76) > >> >>at > >> com.bea.xml.stream.MXParser.checkForXMLDecl(MXParser.java:775) > >> >>at com.bea.xml.stream.MXParser.setInput(MXParser.java:806) > >> >>at > >> >> > >> > com.bea.xml.stream.MXParserFactory.createXMLStreamReader(MXParserFactory.java:261) > >> >>at > >> >> > >> > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:93) > >> >>... 10 more > >> >> > >> >> He
Nutch <-> Solr latest?
Hi, Im curious, is there a spot / patch for the latest on Nutch / Solr integration, Ive found a few pages (a few outdated it seems), it would be nice (?) if it worked as a DataSource type to DataImportHandler, but not sure if that fits w/ how it works. Either way a nice contrib patch the way the DIH is already setup would be nice to have. Is there currently work ongoing on this? Seems like it belongs in either / or project and not both. Thanks. - Jon
Re: SpellCheckComponent: No file-based suggestions + Location issue
Shalin: > The index directory location is being created inside the current working > directory. We should change that. I've opened SOLR-604 and attached a patch > which fixes this. I updated from nightly build to incorporate your fix and it works perfectly, now building the spell indexes in solr/data. Thanks! Grant: > What happens when you open the built index in Luke > (http://www.getopt.org/luke)? Hmm, it looks a bit spacey -- I see the n-grams (n=3,4) but the text looks interspersed with spaces. Perhaps this is an artifact of Luke or n-grams are supposed to be this way, but that would obviously seem problematic. Here are some snips: " h i s t o r y " " p i z z a " "i z" " i " > Did you see any exceptions in your log? Just a warning which I've ignored based on the discussions in SOLR-572: WARNING: No fieldType: null found for dictionary: external. Using WhitespaceAnalzyer. Oddly, even if I specify a fieldType with a legitimate field type (e.g., spell) from my schema.xml, this same warning is thrown, so I assume the parameter is functionless. WARNING: No fieldType: spell found for dictionary: external. Using WhitespaceAnalzyer. Ron
Re: Wildcard search question
Norberto Meijome wrote: ok well let's say that i can live without john/jon in the short term. what i really need today is a case insensitive wildcard search with literal matching (no fancy stemming. bobby is bobby, not bobbi.) what are my options? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters define your own type (or modify text / string... but I find that it gets confusing to have variations of text / string ...) to perform the operations on the content as needed. There are also other tokenizer/analysers available that *may* help in the partial searches (ngram , edgengram ), but there isn't much documentation on them yet (that I could find) - I am only getting into them myself i'll see how it goes.. thanks, that got me on the right track. i came up with this: now searching for user_name:bobby* works as i wanted. my next question: is there a way that i can score matches that are at the start of the string higher than matches in the middle? for example, if i search for steve, i get kelly stevenson before steve jobs. i'd like steve jobs to come first. -jsd-
Re: How to use SOLR1.2
: I am new in SOLR 1.2, configured Admin GUI. Facing problem in using : this. could you pls help me out to configure the nex. the admin GUI isn't really a place where you configure Solr. It's a way to see the status of things -- configuration is done via config files. have you con through the tutorial? http://lucene.apache.org/solr/tutorial.html if you have some specific questions about problems you are having, please post more detailed questions. -Hoss
Re: UnicodeNormalizationFilterFactory
: I've seen mention of these filters: : : : Are you asking because you saw these in Robert Haschart's reply to your previous question? I think those are custom Filters that he has in his project ... not open source (but i may be wrong) they are certainly not something that comes out of the box w/ Solr. -Hoss
Can I add field compression without reindexing?
I have an index that I eventually want to rebuild so I can set compressed=true on a couple of fields. It's not really practical to rebuild the whole thing right now, though. If I change my schema.xml to set compressed=true and then keep adding new data to the existing index, will this corrupt the index, or will the *new* data be stored in compressed format, even while the old data is not compressed?
Re: Can I specify the default operator at query time ?
: Subject: Can I specify the default operator at query time ? : In-Reply-To: <[EMAIL PROTECTED]> http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is "hidden" in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/Thread_hijacking -Hoss
DataImportHandler running out of memory
I'm trying to load ~10 million records into Solr using the DataImportHandler. I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as soon as I try loading more than about 5 million records. Here's my configuration: I'm connecting to a SQL Server database using the sqljdbc driver. I've given my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to 1. My SQL query is "select top XXX field1, ... from table1". I have about 40 fields in my Solr schema. I thought the DataImportHandler would stream data from the DB rather than loading it all into memory at once. Is that not the case? Any thoughts on how to get around this (aside from getting a machine with more memory)? -- View this message in context: http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DataImportHandler running out of memory
This is a bug in MySQL. Try setting the Fetch Size the Statement on the connection to Integer.MIN_VALUE. See http://forums.mysql.com/read.php?39,137457 amongst a host of other discussions on the subject. Basically, it tries to load all the rows into memory, the only alternative is to set the fetch size to Integer.MIN_VALUE so that it gets it one row at a time. I've hit this one myself and it isn't caused by the DataImportHandler, but by the MySQL JDBC handler. -Grant On Jun 24, 2008, at 8:23 PM, wojtekpia wrote: I'm trying to load ~10 million records into Solr using the DataImportHandler. I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as soon as I try loading more than about 5 million records. Here's my configuration: I'm connecting to a SQL Server database using the sqljdbc driver. I've given my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to 1. My SQL query is "select top XXX field1, ... from table1". I have about 40 fields in my Solr schema. I thought the DataImportHandler would stream data from the DB rather than loading it all into memory at once. Is that not the case? Any thoughts on how to get around this (aside from getting a machine with more memory)? -- View this message in context: http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
How to debug ?
hi, I'm trying to understand why a search on a field tokenized with the nGram tokenizer, with minGramSize=n and maxGramSize=m doesn't find any matches for queries of length (in characters) of n+1..m (n works fine). analysis.jsp shows that it SHOULD match, but /select doesn't bring anything back. (For details on this queries, please see my previous post over the last day or so to this list). So i figure there is some difference between what analysis.jsp does and the actual search executed , or what lucene indexes - i imagine analysis.jsp only parses the input in the page with solr's tokenizers/filters but doesn't actually do lucene's part of the job. And I'd like to look into this... What is the suggested approach for this? attach a debugger to jetty's web app ? Are there some pointers on how to debug at this level? Preferably in Eclipse, but beggars cant be choosers ;) thanks!! B _ {Beto|Norberto|Numard} Meijome "Always do right. This will gratify some and astonish the rest." Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: How to debug ?
also, check the LukeRequestHandler if there is a document you think *should* match, you can see what tokens it has actually indexed... On Jun 24, 2008, at 7:12 PM, Norberto Meijome wrote: hi, I'm trying to understand why a search on a field tokenized with the nGram tokenizer, with minGramSize=n and maxGramSize=m doesn't find any matches for queries of length (in characters) of n+1..m (n works fine). analysis.jsp shows that it SHOULD match, but /select doesn't bring anything back. (For details on this queries, please see my previous post over the last day or so to this list). So i figure there is some difference between what analysis.jsp does and the actual search executed , or what lucene indexes - i imagine analysis.jsp only parses the input in the page with solr's tokenizers/filters but doesn't actually do lucene's part of the job. And I'd like to look into this... What is the suggested approach for this? attach a debugger to jetty's web app ? Are there some pointers on how to debug at this level? Preferably in Eclipse, but beggars cant be choosers ;) thanks!! B _ {Beto|Norberto|Numard} Meijome "Always do right. This will gratify some and astonish the rest." Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: DataImportHandler running out of memory
Setting the batchSize to 1 would mean that the Jdbc driver will keep 1 rows in memory *for each entity* which uses that data source (if correctly implemented by the driver). Not sure how well the Sql Server driver implements this. Also keep in mind that Solr also needs memory to index documents. You can probably try setting the batch size to a lower value. The regular memory tuning stuff should apply here too -- try disabling autoCommit and turn-off autowarming and see if it helps. On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote: > > I'm trying to load ~10 million records into Solr using the > DataImportHandler. > I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as > soon as I try loading more than about 5 million records. > > Here's my configuration: > I'm connecting to a SQL Server database using the sqljdbc driver. I've > given > my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to > 1. My SQL query is "select top XXX field1, ... from table1". I have > about 40 fields in my Solr schema. > > I thought the DataImportHandler would stream data from the DB rather than > loading it all into memory at once. Is that not the case? Any thoughts on > how to get around this (aside from getting a machine with more memory)? > > -- > View this message in context: > http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Regards, Shalin Shekhar Mangar.
Re: How to debug ?
On Tue, 24 Jun 2008 19:17:58 -0700 Ryan McKinley <[EMAIL PROTECTED]> wrote: > also, check the LukeRequestHandler > > if there is a document you think *should* match, you can see what > tokens it has actually indexed... right, I will look into that a bit more. I am actually using the lukeall.jar (0.8.1, linked against lucene 2.4) to look into what got indexed, but I am bit wary of how what I select in the the 'analyzer' drop down option in Luke actually affects what I see. B _ {Beto|Norberto|Numard} Meijome "Web2.0 is outsourced R&D from Web1.0 companies." The Reverend I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
RE: UnicodeNormalizationFilterFactory
ISOLatin1AccentFilterFactory works quite well for us. It solves our basic euro-text keyboard searching problem, where "protege" should find protégé. ("protege" with two accents.) -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 24, 2008 4:05 PM To: solr-user@lucene.apache.org Subject: Re: UnicodeNormalizationFilterFactory : I've seen mention of these filters: : : : Are you asking because you saw these in Robert Haschart's reply to your previous question? I think those are custom Filters that he has in his project ... not open source (but i may be wrong) they are certainly not something that comes out of the box w/ Solr. -Hoss
Re: DataImportHandler running out of memory
DIH streams rows one by one. set the fetchSize="-1" this might help. It may make the indexing a bit slower but memory consumption would be low. The memory is consumed by the jdbc driver. try tuning the -Xmx value for the VM --Noble On Wed, Jun 25, 2008 at 8:05 AM, Shalin Shekhar Mangar <[EMAIL PROTECTED]> wrote: > Setting the batchSize to 1 would mean that the Jdbc driver will keep > 1 rows in memory *for each entity* which uses that data source (if > correctly implemented by the driver). Not sure how well the Sql Server > driver implements this. Also keep in mind that Solr also needs memory to > index documents. You can probably try setting the batch size to a lower > value. > > The regular memory tuning stuff should apply here too -- try disabling > autoCommit and turn-off autowarming and see if it helps. > > On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote: > >> >> I'm trying to load ~10 million records into Solr using the >> DataImportHandler. >> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as >> soon as I try loading more than about 5 million records. >> >> Here's my configuration: >> I'm connecting to a SQL Server database using the sqljdbc driver. I've >> given >> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to >> 1. My SQL query is "select top XXX field1, ... from table1". I have >> about 40 fields in my Solr schema. >> >> I thought the DataImportHandler would stream data from the DB rather than >> loading it all into memory at once. Is that not the case? Any thoughts on >> how to get around this (aside from getting a machine with more memory)? >> >> -- >> View this message in context: >> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul
Re: DataImportHandler running out of memory
it is batchSize="-1" not fetchSize. Or keep it to a very small value. --Noble On Wed, Jun 25, 2008 at 9:31 AM, Noble Paul നോബിള് नोब्ळ् <[EMAIL PROTECTED]> wrote: > DIH streams rows one by one. > set the fetchSize="-1" this might help. It may make the indexing a bit > slower but memory consumption would be low. > The memory is consumed by the jdbc driver. try tuning the -Xmx value for the > VM > --Noble > > On Wed, Jun 25, 2008 at 8:05 AM, Shalin Shekhar Mangar > <[EMAIL PROTECTED]> wrote: >> Setting the batchSize to 1 would mean that the Jdbc driver will keep >> 1 rows in memory *for each entity* which uses that data source (if >> correctly implemented by the driver). Not sure how well the Sql Server >> driver implements this. Also keep in mind that Solr also needs memory to >> index documents. You can probably try setting the batch size to a lower >> value. >> >> The regular memory tuning stuff should apply here too -- try disabling >> autoCommit and turn-off autowarming and see if it helps. >> >> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote: >> >>> >>> I'm trying to load ~10 million records into Solr using the >>> DataImportHandler. >>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as >>> soon as I try loading more than about 5 million records. >>> >>> Here's my configuration: >>> I'm connecting to a SQL Server database using the sqljdbc driver. I've >>> given >>> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to >>> 1. My SQL query is "select top XXX field1, ... from table1". I have >>> about 40 fields in my Solr schema. >>> >>> I thought the DataImportHandler would stream data from the DB rather than >>> loading it all into memory at once. Is that not the case? Any thoughts on >>> how to get around this (aside from getting a machine with more memory)? >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> > > > > -- > --Noble Paul > -- --Noble Paul
Re: How to debug ?
Hello Beto, There is a plugin for jetty: http://webtide.com/eclipse. Insert this as and update site and let eclipse install the plugin for you You can then start the jetty server from eclipse and debug it. Brian. Am Mittwoch, den 25.06.2008, 12:48 +1000 schrieb Norberto Meijome: > On Tue, 24 Jun 2008 19:17:58 -0700 > Ryan McKinley <[EMAIL PROTECTED]> wrote: > > > also, check the LukeRequestHandler > > > > if there is a document you think *should* match, you can see what > > tokens it has actually indexed... > > right, I will look into that a bit more. > > I am actually using the lukeall.jar (0.8.1, linked against lucene 2.4) to look > into what got indexed, but I am bit wary of how what I select in the the > 'analyzer' drop down option in Luke actually affects what I see. > > B > > _ > {Beto|Norberto|Numard} Meijome > > "Web2.0 is outsourced R&D from Web1.0 companies." >The Reverend > > I speak for myself, not my employer. Contents may be hot. Slippery when wet. > Reading disclaimers makes you go blind. Writing them is worse. You have been > Warned.
Re: How to debug ?
On Wed, 25 Jun 2008 08:37:35 +0200 Brian Carmalt <[EMAIL PROTECTED]> wrote: > There is a plugin for jetty: http://webtide.com/eclipse. Insert this as > and update site and let eclipse install the plugin for you You can then > start the jetty server from eclipse and debug it. Thanks Brian, good information :) B _ {Beto|Norberto|Numard} Meijome Q. How do you make God laugh? A. Tell him your plans. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.