several tokenizers in one field type

2008-06-24 Thread Norberto Meijome
hi all,
( I'm using 1.3 nightly build from 15th June 08.) 

Is there some documentation about how analysers + tokenizers are applied in
fields ?  In particular, my question :

- If I define 2 tokenizers in a fieldtype, only the first one is applied, the
other is ignored. Is that because the 2nd tokenizer would have to work
recursively on the tokens generated from the previous one? Would I have to
create my custom tokenizer to perform the job of 2 existing tokenizers in one ? 

I'll send some other questions in a separate email...
thx
B
_
{Beto|Norberto|Numard} Meijome

"Build a system that even a fool can use, and only a fool will want to use it."
   George Bernard Shaw

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: several tokenizers in one field type

2008-06-24 Thread Ryan McKinley


On Jun 24, 2008, at 12:07 AM, Norberto Meijome wrote:

hi all,
( I'm using 1.3 nightly build from 15th June 08.)

Is there some documentation about how analysers + tokenizers are  
applied in

fields ?  In particular, my question :



best docs are here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


- If I define 2 tokenizers in a fieldtype, only the first one is  
applied, the

other is ignored. Is that because the 2nd tokenizer would have to work
recursively on the tokens generated from the previous one? Would I  
have to
create my custom tokenizer to perform the job of 2 existing  
tokenizers in one ?


if you define two tokenizers, solr should throw an error  the  
second one can't do anything.


The tokenizer breaks the input stream into a stream of tokens, then  
token filters can modify these tokens.


ryan




(Edge)NGram tokenizer interaction with other filters

2008-06-24 Thread Norberto Meijome
hi everyone,


if I define a field as 

   


   
 







 










I would expect that, when pushing data into it, this is what would happen:
 - Stop words removed by StopFilterFactory
 - content broken into several 'words' as per WordDelimiterFilterFactory.
 - the result of all this passed to EdgeNGram (or nGram) tokenizer

so, when indexing 'The Veronicas', only 'Veronicas' would reach the NGram 
tokenizer

What I find is that the n-gram tokenizers kick in first, and the filters after, 
making it a rather moot exercise. I've confirmed the steps in analysis.jsp :

Index Analyzer
org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=15, minGramSize=2}
[..]
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
ignoreCase=true, enablePositionIncrements=true}
[..]
org.apache.solr.analysis.LowerCaseFilterFactory {}
[...]
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
[...]

What am I doing / understanding wrong? 

thanks!!
B
_
{Beto|Norberto|Numard} Meijome

Windows caters to everyone as though they are idiots. UNIX makes no such 
assumption. It assumes you know what you are doing, and presents the challenge 
of figuring  it out for yourself if you don't.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: several tokenizers in one field type

2008-06-24 Thread Norberto Meijome
On Tue, 24 Jun 2008 00:14:57 -0700
Ryan McKinley <[EMAIL PROTECTED]> wrote:

> best docs are here:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

yes, I've been reading that already , thanks :) 
> 
> > - If I define 2 tokenizers in a fieldtype, only the first one is  
> > applied, the
> > other is ignored. Is that because the 2nd tokenizer would have to work
> > recursively on the tokens generated from the previous one? Would I  
> > have to
> > create my custom tokenizer to perform the job of 2 existing  
> > tokenizers in one ?  
> 
> if you define two tokenizers, solr should throw an error  the  
> second one can't do anything.

no error that I can see - i'm using the default log settings from the solr test 
app bundled with nightly build.
 
> The tokenizer breaks the input stream into a stream of tokens, then  
> token filters can modify these tokens.

ok, that makes sense.That *should* explain what I described in my other email.( 
Subject: (Edge)NGram tokenizer interaction with other filters )

thanks a lot Ryan :)
B
_
{Beto|Norberto|Numard} Meijome

"Tell a person you're the Metatron and they stare at you blankly. Mention 
something out of a Charleton Heston movie and suddenly everyone's a Theology 
scholar!"
   Dogma

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Parser of Response XML

2008-06-24 Thread Ranjeet
Hi,

is any class is available in SOLR API to parse the response XML?

Regards,
Ranjeet

Re: Parser of Response XML

2008-06-24 Thread Noble Paul നോബിള്‍ नोब्ळ्
org.apache.solr.client.solrj.impl.XMLResponseParser

On Tue, Jun 24, 2008 at 3:06 PM, Ranjeet <[EMAIL PROTECTED]> wrote:
> Hi,
>
> is any class is available in SOLR API to parse the response XML?
>
> Regards,
> Ranjeet



-- 
--Noble Paul


Re: (Edge)NGram tokenizer interaction with other filters

2008-06-24 Thread Otis Gospodnetic
One tokenizer is followed by filters.  I think this all might be a bit clearer 
if you read the chapter about Analyzers in Lucene in Action if you have a copy. 
 I think if you try to break down that "the result of all this passed to " into 
something more concrete and real you will see how things (should) work.


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Norberto Meijome <[EMAIL PROTECTED]>
> To: SOLR-Usr-ML 
> Sent: Tuesday, June 24, 2008 3:19:09 AM
> Subject: (Edge)NGram tokenizer interaction with other filters
> 
> hi everyone,
> 
> 
> if I define a field as 
> 
>   
>  positionIncrementGap="100">
> 
> 
>  words="stopwords.txt" enablePositionIncrements="true"/>
>   
> 
>  generateWordParts="1" generateNumberParts="1" 
> catenateWords="1"
>  catenateNumbers="1" catenateAll="1"/>
> 
> 
>  class="org.apache.solr.analysis.EdgeNGramTokenizerFactory"
>  minGramSize="2" maxGramSize="15"/>
> 
> 
> 
> 
> 
> 
>  words="stopwords.txt"/>
> 
> 
>  generateWordParts="1" generateNumberParts="1" 
> catenateWords="0"
>  catenateNumbers="0" catenateAll="0"/>
> 
> 
>  class="org.apache.solr.analysis.EdgeNGramTokenizerFactory"
>  minGramSize="2" maxGramSize="15"/>
> 
> 
> 
> 
> 
> 
> I would expect that, when pushing data into it, this is what would happen:
> - Stop words removed by StopFilterFactory
> - content broken into several 'words' as per WordDelimiterFilterFactory.
> - the result of all this passed to EdgeNGram (or nGram) tokenizer
> 
> so, when indexing 'The Veronicas', only 'Veronicas' would reach the NGram 
> tokenizer
> 
> What I find is that the n-gram tokenizers kick in first, and the filters 
> after, 
> making it a rather moot exercise. I've confirmed the steps in analysis.jsp :
> 
> Index Analyzer
> org.apache.solr.analysis.NGramTokenizerFactory {maxGramSize=15, minGramSize=2}
> [..]
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt, 
> ignoreCase=true, enablePositionIncrements=true}
> [..]
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> [...]
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> [...]
> 
> What am I doing / understanding wrong? 
> 
> thanks!!
> B
> _
> {Beto|Norberto|Numard} Meijome
> 
> Windows caters to everyone as though they are idiots. UNIX makes no such 
> assumption. It assumes you know what you are doing, and presents the 
> challenge 
> of figuring  it out for yourself if you don't.
> 
> I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
> Reading disclaimers makes you go blind. Writing them is worse. You have been 
> Warned.



SOLR-139 (Support updateable/modifiable documents)

2008-06-24 Thread Dave Searle
Hi,



Does anyone know if SOLR-139 (Support updateable/modifiable documents) will 
make it back into the 1.3 release? I'm looking for a way to append data to a 
multivalued field in a document over a period of time (in which the document 
represents a forum thread and the multivalued field represents the messages 
attached to this thread).



Thanks,

Dave



Dave Searle
Lead Developer MPS
Magicalia Ltd.





Thank you for your interest in Magicalia Media.

www.magicalia.com

Special interest communities are Magicalia's mission in life.  Magicalia 
publishes specialist websites 
and magazine titles for people who have a passion for their hobby, sport or 
area of interest.

For further information, please call 01689 899200 or fax 01689 899266.

Magicalia Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN 
Registered: England & Wales, Registered Number: 3828584, VAT Number: 744 4983 00

Magicalia Publishing Ltd, Berwick House, 8-10 Knoll Rise, Orpington, BR6 0EL
Registered: England & Wales, Registered Number: 5649018, VAT Number: 872 8179 83

Magicalia Media Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN 
Registered: England & Wales, Registered Number: 5780320, VAT Number: 888 0357 82

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual 
or entity to which they are addressed.  If you have received this email in 
error please reply to this email and 
then delete it.  Please note that any views or opinions presented in this email 
are solely those of the author 
and do not necessarily represent those of Magicalia. The recipient should check 
this email and any 
attachments for the presence of viruses.  Magicalia accepts no liability for 
any damage caused by any virus 
transmitted by this email. Magicalia may regularly and randomly monitor 
outgoing and incoming emails and 
other telecommunications on its email and telecommunications systems. By 
replying to this email you give 
your consent to such monitoring. Copyright in this e-mail and any attachments 
created by Magicalia Media 
belongs to Magicalia Media.


Re: (Edge)NGram tokenizer interaction with other filters

2008-06-24 Thread Norberto Meijome
On Tue, 24 Jun 2008 04:54:46 -0700 (PDT)
Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

> One tokenizer is followed by filters.  I think this all might be a bit 
> clearer if you read the chapter about Analyzers in Lucene in Action if you 
> have a copy.  I think if you try to break down that "the result of all this 
> passed to " into something more concrete and real you will see how things 
> (should) work.


thanks Otis, from this and Ryan's previous reply I understand I was mistaken on
how I was seeing the process - i was expecting the filters / tokenizers to work
as processes with the output of one going to the input of the next , in the
order shown in fieldType definition. .. now that I write this i remember
reading some posts on this list about doing something like this ... open-pipe ?

anyway, it makes sense...not what  I was hoping for, but it's what I have to
work with. 

Now, if only I can get n-gram to work with search terms > minGramSize :P

Thanks for your time, help and recommendation of Lucene in Action.

B
_
{Beto|Norberto|Numard} Meijome

"The greatest dangers to liberty lurk in insidious encroachment by men of zeal,
well-meaning but without understanding." Justice Louis D. Brandeis

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


SOLR-469 - bad patch?

2008-06-24 Thread Jon Baer
It seems the new patch @ https://issues.apache.org/jira/browse/ 
SOLR-469 is x2 the size but turns out the patch itself might be bad?


Ie, it dumps build.xml twice, is it just me?

Thanks.

- Jon


Re: SOLR-139 (Support updateable/modifiable documents)

2008-06-24 Thread Otis Gospodnetic
I don't know if SOLR-139 will make it into 1.3, but from your brief 
description, I'd say you might want to consider a different schema for your 
data.  Stuffing thread messages in the same doc that represents a thread may 
not be the best choice.  Of course, you may have good reasons for doing that, I 
just don't know them.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Dave Searle <[EMAIL PROTECTED]>
> To: "solr-user@lucene.apache.org" 
> Sent: Tuesday, June 24, 2008 8:34:47 AM
> Subject: SOLR-139 (Support updateable/modifiable documents)
> 
> Hi,
> 
> 
> 
> Does anyone know if SOLR-139 (Support updateable/modifiable documents) will 
> make 
> it back into the 1.3 release? I'm looking for a way to append data to a 
> multivalued field in a document over a period of time (in which the document 
> represents a forum thread and the multivalued field represents the messages 
> attached to this thread).
> 
> 
> 
> Thanks,
> 
> Dave
> 
> 
> 
> Dave Searle
> Lead Developer MPS
> Magicalia Ltd.
> 
> 
> 
> 
> 
> Thank you for your interest in Magicalia Media.
> 
> www.magicalia.com
> 
> Special interest communities are Magicalia's mission in life.  Magicalia 
> publishes specialist websites 
> and magazine titles for people who have a passion for their hobby, sport or 
> area 
> of interest.
> 
> For further information, please call 01689 899200 or fax 01689 899266.
> 
> Magicalia Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN 
> Registered: England & Wales, Registered Number: 3828584, VAT Number: 744 4983 
> 00
> 
> Magicalia Publishing Ltd, Berwick House, 8-10 Knoll Rise, Orpington, BR6 0EL
> Registered: England & Wales, Registered Number: 5649018, VAT Number: 872 8179 
> 83
> 
> Magicalia Media Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN 
> Registered: England & Wales, Registered Number: 5780320, VAT Number: 888 0357 
> 82
> 
> This email and any files transmitted with it are confidential and intended 
> solely for the use of the individual 
> or entity to which they are addressed.  If you have received this email in 
> error 
> please reply to this email and 
> then delete it.  Please note that any views or opinions presented in this 
> email 
> are solely those of the author 
> and do not necessarily represent those of Magicalia. The recipient should 
> check 
> this email and any 
> attachments for the presence of viruses.  Magicalia accepts no liability for 
> any 
> damage caused by any virus 
> transmitted by this email. Magicalia may regularly and randomly monitor 
> outgoing 
> and incoming emails and 
> other telecommunications on its email and telecommunications systems. By 
> replying to this email you give 
> your consent to such monitoring. Copyright in this e-mail and any attachments 
> created by Magicalia Media 
> belongs to Magicalia Media.



Re: Accented search

2008-06-24 Thread climbingrose
Here is how I did it (the code is from memory so it might not be correct
100%):
private boolean hasAccents;
private Token filteredToken;

public final Token next() throws IOException {
  if (hasAccents) {
hasAccents = false;
return filteredToken;
  }
  Token t = input.next();
  String filteredText = removeAccents(t.termText());
  if (filteredText.equals(t.termText()) { //no accents
return t;
  } else {
filteredToken = (Token) t.clone();
filteredToken.setTermText(filteredText):
filteredToken.setPositionIncrement(0);
hasAccents = true;
  }
  return t;
}

On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[EMAIL PROTECTED]> wrote:

> Regarding indexing words with accented and unaccented characters with
> positionIncrement zero:
>
> Chris Hostetter wrote:
>
>>
>> you don't really need a custom tokenizer -- just a buffered TokenFilter
>> that clones the original token if it contains accent chars, mutates the
>> clone, and then emits it next with a positionIncrement of 0.
>>
>>
> Could someone expand on how to implement this technique of buffering and
> cloning?
>
> Thanks,
>
> Phil
>



-- 
Regards,

Cuong Hoang


Otis : Re: n-Gram, only works with queries of 2 letters

2008-06-24 Thread Norberto Meijome
On Tue, 24 Jun 2008 09:10:58 +1000
Norberto Meijome <[EMAIL PROTECTED]> wrote:

> On Mon, 23 Jun 2008 05:33:49 -0700 (PDT)
> Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> 
> > Hi,
> > 
> > 
> > When you add &debugQuery=true to the request, what does your query look 
> > like after parsing?

hi Otis, can you provide some insight as to what is going on here? am I only 
supposed to use search terms of length = minGramSize against fields tokenized 
with nGramTokenizer  ?

Any pointers will be greatly appreciated.

TIA for your time,
Beto

> 
> Hi Otis,
> sorry, i should have sent this before too.
> With minGramSize = 3 , same data, clean server start, index rebuilt. 2 cases 
> shown below, one not working, one working. The 4 letter case (not working) 
> seems to be parsed properly, and as expected one of the tokens generated is 
> same as my 3 letter query that does work.
> 
> DOESN'T WORK AS EXPECTED CASE
> 
> −
>   
> −
>   
> 0
> 53
> −
>   
> eche
> artist_ngram
> true
> 
> 
> 
> −
>   
> eche
> eche
> PhraseQuery(artist_ngram:"ech che eche")
> artist_ngram:"ech che eche"
> 
> −
>   
> 52.0
> −
>   
> 0.0
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> 
> −
>   
> 52.0
> −
>   
> 22.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 30.0
> 
> 
> 
> 
> 
> 
> ---
> 
> WORKS AS EXPECTED CASE
> 
> http://localhost:8983/solr/_test_/select?q=ech&df=artist_ngram&debugQuery=true
> 
> −
>   
> −
>   
> 0
> 57
> −
>   
> ech
> artist_ngram
> true
> 
> 
> −
>   
> −
>   
> Depeche Mode
> Depeche Mode
> Depeche Mode
> 2008-06-23T06:28:36.758Z
> 
> 
> −
>   
> ech
> ech
> artist_ngram:ech
> artist_ngram:ech
> −
>   
> −
>   
> 
> 0.90429556 = (MATCH) fieldWeight(artist_ngram:ech in 43), product of:
>   1.0 = tf(termFreq(artist_ngram:ech)=1)
>   5.787492 = idf(docFreq=1, numDocs=240)
>   0.15625 = fieldNorm(field=artist_ngram, doc=43)
> 
> 
> −
>   
> 57.0
> −
>   
> 0.0
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> 
> −
>   
> 57.0
> −
>   
> 57.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> −
>   
> 0.0
> 
> 
> 
> 
> 
> 
> Thanks,
> B


_
{Beto|Norberto|Numard} Meijome

Software QA is like cleaning my cat's litter box: Sift out the big chunks. Stir 
in the rest. Hope it doesn't stink.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


RE: SOLR-139 (Support updateable/modifiable documents)

2008-06-24 Thread Dave Searle
Thanks Otis,

At the moment I have an index of forum messages (each message being a separate 
doc). Results are displayed on a per message basis, however, I would like to 
group the results via their thread. Apart from using a facet on the thread 
title (which would lose relevancy), I cannot see a way of doing this.

So my idea was to build a new index with the thread being the main document 
entity and a multivalued field for the message data. Using the work done in 
SOLR-139 I could then update this field as new messages are posted (and any 
other thread fields such as message count, date of the last post and so on)

Without SOLR 139, I would currently have to re-index the whole thread; some 
threads having thousands of messages which could obviously take some time! :)

Am I looking at this from the wrong angle? Have you come across similar 
scenarios?

Thanks for your time,
Dave



-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: 24 June 2008 15:33
To: solr-user@lucene.apache.org
Subject: Re: SOLR-139 (Support updateable/modifiable documents)

I don't know if SOLR-139 will make it into 1.3, but from your brief 
description, I'd say you might want to consider a different schema for your 
data.  Stuffing thread messages in the same doc that represents a thread may 
not be the best choice.  Of course, you may have good reasons for doing that, I 
just don't know them.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Dave Searle <[EMAIL PROTECTED]>
> To: "solr-user@lucene.apache.org" 
> Sent: Tuesday, June 24, 2008 8:34:47 AM
> Subject: SOLR-139 (Support updateable/modifiable documents)
>
> Hi,
>
>
>
> Does anyone know if SOLR-139 (Support updateable/modifiable documents) will 
> make
> it back into the 1.3 release? I'm looking for a way to append data to a
> multivalued field in a document over a period of time (in which the document
> represents a forum thread and the multivalued field represents the messages
> attached to this thread).
>
>
>
> Thanks,
>
> Dave
>
>
>
> Dave Searle
> Lead Developer MPS
> Magicalia Ltd.
>
>
>
>
>
> Thank you for your interest in Magicalia Media.
>
> www.magicalia.com
>
> Special interest communities are Magicalia's mission in life.  Magicalia
> publishes specialist websites
> and magazine titles for people who have a passion for their hobby, sport or 
> area
> of interest.
>
> For further information, please call 01689 899200 or fax 01689 899266.
>
> Magicalia Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN
> Registered: England & Wales, Registered Number: 3828584, VAT Number: 744 4983 
> 00
>
> Magicalia Publishing Ltd, Berwick House, 8-10 Knoll Rise, Orpington, BR6 0EL
> Registered: England & Wales, Registered Number: 5649018, VAT Number: 872 8179 
> 83
>
> Magicalia Media Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN
> Registered: England & Wales, Registered Number: 5780320, VAT Number: 888 0357 
> 82
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual
> or entity to which they are addressed.  If you have received this email in 
> error
> please reply to this email and
> then delete it.  Please note that any views or opinions presented in this 
> email
> are solely those of the author
> and do not necessarily represent those of Magicalia. The recipient should 
> check
> this email and any
> attachments for the presence of viruses.  Magicalia accepts no liability for 
> any
> damage caused by any virus
> transmitted by this email. Magicalia may regularly and randomly monitor 
> outgoing
> and incoming emails and
> other telecommunications on its email and telecommunications systems. By
> replying to this email you give
> your consent to such monitoring. Copyright in this e-mail and any attachments
> created by Magicalia Media
> belongs to Magicalia Media.



__ Information from ESET Smart Security, version of virus signature 
database 3213 (20080624) __

The message was checked by ESET Smart Security.

http://www.eset.com



__ Information from ESET Smart Security, version of virus signature 
database 3213 (20080624) __

The message was checked by ESET Smart Security.

http://www.eset.com



Re: SOLR-139 (Support updateable/modifiable documents)

2008-06-24 Thread Norberto Meijome
On Tue, 24 Jun 2008 16:04:24 +0100
Dave Searle <[EMAIL PROTECTED]> wrote:

> At the moment I have an index of forum messages (each message being a 
> separate doc). Results are displayed on a per message basis, however, I would 
> like to group the results via their thread. Apart from using a facet on the 
> thread title (which would lose relevancy), I cannot see a way of doing this.

what about storing the thread id (+other information needed to regenerate the 
messages in order) instead of the subject as a facet ? or just use the 
thread_id as a filter...

B

_
{Beto|Norberto|Numard} Meijome

Hildebrant's Principle:
If you don't know where you are going,
any road will get you there.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


RE: SOLR-139 (Support updateable/modifiable documents)

2008-06-24 Thread Dave Searle
I am currently storing the thread id within the message index, however, 
although this would allow me to sort, it doesn't help with the grouping of 
threads based on relevancy. See the idea is to index message data in the thread 
documents and then boost the message mutlivalued field over the thread title 
and thread description (which in my opinion, would give better results).

The user, when presented with the thread results, could then drill down into a 
particular thread's messages using the same search terms on the message index 
(but filtered by the referring thread id)

-Original Message-
From: Norberto Meijome [mailto:[EMAIL PROTECTED]
Sent: 24 June 2008 16:16
To: solr-user@lucene.apache.org
Subject: Re: SOLR-139 (Support updateable/modifiable documents)

On Tue, 24 Jun 2008 16:04:24 +0100
Dave Searle <[EMAIL PROTECTED]> wrote:

> At the moment I have an index of forum messages (each message being a 
> separate doc). Results are displayed on a per message basis, however, I would 
> like to group the results via their thread. Apart from using a facet on the 
> thread title (which would lose relevancy), I cannot see a way of doing this.

what about storing the thread id (+other information needed to regenerate the 
messages in order) instead of the subject as a facet ? or just use the 
thread_id as a filter...

B

_
{Beto|Norberto|Numard} Meijome

Hildebrant's Principle:
If you don't know where you are going,
any road will get you there.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.





__ Information from ESET Smart Security, version of virus signature 
database 3213 (20080624) __

The message was checked by ESET Smart Security.

http://www.eset.com



__ Information from ESET Smart Security, version of virus signature 
database 3213 (20080624) __

The message was checked by ESET Smart Security.

http://www.eset.com



Thank you for your interest in Magicalia Media.

www.magicalia.com

Special interest communities are Magicalia's mission in life.  Magicalia 
publishes specialist websites 
and magazine titles for people who have a passion for their hobby, sport or 
area of interest.

For further information, please call 01689 899200 or fax 01689 899266.

Magicalia Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN 
Registered: England & Wales, Registered Number: 3828584, VAT Number: 744 4983 00

Magicalia Publishing Ltd, Berwick House, 8-10 Knoll Rise, Orpington, BR6 0EL
Registered: England & Wales, Registered Number: 5649018, VAT Number: 872 8179 83

Magicalia Media Ltd, Caxton House, 2 Farringdon Road, London, EC1M 3HN 
Registered: England & Wales, Registered Number: 5780320, VAT Number: 888 0357 82

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual 
or entity to which they are addressed.  If you have received this email in 
error please reply to this email and 
then delete it.  Please note that any views or opinions presented in this email 
are solely those of the author 
and do not necessarily represent those of Magicalia. The recipient should check 
this email and any 
attachments for the presence of viruses.  Magicalia accepts no liability for 
any damage caused by any virus 
transmitted by this email. Magicalia may regularly and randomly monitor 
outgoing and incoming emails and 
other telecommunications on its email and telecommunications systems. By 
replying to this email you give 
your consent to such monitoring. Copyright in this e-mail and any attachments 
created by Magicalia Media 
belongs to Magicalia Media.


Re: Accented search

2008-06-24 Thread Robert Haschart


climbingrose wrote:


Here is how I did it (the code is from memory so it might not be correct
100%):
private boolean hasAccents;
private Token filteredToken;

public final Token next() throws IOException {
 if (hasAccents) {
   hasAccents = false;
   return filteredToken;
 }
 Token t = input.next();
 String filteredText = removeAccents(t.termText());
 if (filteredText.equals(t.termText()) { //no accents
   return t;
 } else {
   filteredToken = (Token) t.clone();
   filteredToken.setTermText(filteredText):
   filteredToken.setPositionIncrement(0);
   hasAccents = true;
 }
 return t;
}

On Sat, Jun 21, 2008 at 2:37 AM, Phillip Farber <[EMAIL PROTECTED]> wrote:

 


Regarding indexing words with accented and unaccented characters with
positionIncrement zero:

Chris Hostetter wrote:

   


you don't really need a custom tokenizer -- just a buffered TokenFilter
that clones the original token if it contains accent chars, mutates the
clone, and then emits it next with a positionIncrement of 0.


 


Could someone expand on how to implement this technique of buffering and
cloning?

Thanks,

Phil

   



 

I just was facing the same issue and came up with the following as a 
solution.


I changed the Schema.xml file so that for the text field the analyzers 
and filters are as follows:



  
 
   
   
   
   words="stopwords.txt"/>
   generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0"/>

   
   protected="protwords.txt"/>

   
 
 
   
   
   
   synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
   words="stopwords.txt"/>
   generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0"/>

   
   protected="protwords.txt"/>

   
 
   

These two lines are the new ones:
   
   

the first line invokes a custom filter that I borrowed and modified that 
turns decomposed unicode ( like Pe'rez ) to the composed form ( Pérez ) 
the second line replaces accented characters with their unaccented 
equivalents ( Perez )


For the custom filter to work, you must create a lib directory as a 
sibling to the conf directory and place the jar files containing the 
custom filter there.


The Jars can be downloaded from the blacklight subversion repository at:

http://blacklight.rubyforge.org/svn/trunk/solr/lib/

The SolrPlugin.jar contains the classes UnicodeNormalizationFilter and 
UnicodeNormalizationFilterFactory which merely invokes the 
Normalizer.normalize function in the normalizer jar (which is taken from 
the marc4j distribution and which is a subset og the icu4j library)   


-Robert Haschart


Re: SOLR-139 (Support updateable/modifiable documents)

2008-06-24 Thread Norberto Meijome
On Tue, 24 Jun 2008 16:34:44 +0100
Dave Searle <[EMAIL PROTECTED]> wrote:

> I am currently storing the thread id within the message index, however, 
> although this would allow me to sort, it doesn't help with the grouping of 
> threads based on relevancy. See the idea is to index message data in the 
> thread documents and then boost the message mutlivalued field over the thread 
> title and thread description (which in my opinion, would give better results).
> 
> The user, when presented with the thread results, could then drill down into 
> a particular thread's messages using the same search terms on the message 
> index (but filtered by the referring thread id)

It is very very likely that I am just quite late and I'm tired ...but I think 
the approach of having one document per forum message would allow you to 
implement what you want... and, otoh, not too sure the multivalued field would 
work as well.

ie, store the link to the start of the thread and thread subject in all docs, 
as well as  store link to post and text of post (and thread id, etc, as 
needed). boost the content of the posting over the subject.

(but I have a feeling this may not be what you have in mind when you say 
"grouping of threads based on relevancy" .. is it? )

B

_
{Beto|Norberto|Numard} Meijome

"I didn't attend the funeral, but I sent a nice letter saying  I approved of 
it."
  Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: SOLR-469 - bad patch?

2008-06-24 Thread Shalin Shekhar Mangar
I've just uploaded a new patch which applies cleanly on the trunk. Thanks!

On Tue, Jun 24, 2008 at 7:35 PM, Jon Baer <[EMAIL PROTECTED]> wrote:

> It seems the new patch @ https://issues.apache.org/jira/browse/SOLR-469 is
> x2 the size but turns out the patch itself might be bad?
>
> Ie, it dumps build.xml twice, is it just me?
>
> Thanks.
>
> - Jon
>



-- 
Regards,
Shalin Shekhar Mangar.


RE: never desallocate RAM...during search

2008-06-24 Thread r.nieto
Hi,

I'm having problems with the patch. 
With this schema.xml:
  > 
 
If I send documents with a content smaller than 3 I have an exception
during the indexing. If I change the maxLength to, for example, 30 the
documents that before gave the exception are now indexed correctly.

The exception is:

GRAVE: java.lang.StringIndexOutOfBoundsException: String index out of range:
3
at java.lang.String.substring(Unknown Source)
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:262)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProc
essorFactory.java:66)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateReque
stHandler.java:196)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateR
equestHandler.java:123)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.
java:125)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:965)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:3
38)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:
272)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Application
FilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterCh
ain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.ja
va:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.ja
va:175)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128
)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102
)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java
:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263)
at
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:
852)
at
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(H
ttp11AprProtocol.java:584)
at
org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508)
at java.lang.Thread.run(Unknown Source)



I hope this help.

Thanks.

Rober.
-Mensaje original-
De: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Enviado el: lunes, 23 de junio de 2008 20:49
Para: solr-user@lucene.apache.org
Asunto: Re: never desallocate RAM...during search


On Jun 23, 2008, at 8:16 AM, <[EMAIL PROTECTED]> wrote:
> I was doing something similar to your solution to have better  
> searching
> times.
> I download you patch but I have a problem in one class. I'm not sure  
> if I'm
> doing something wrong but if I what to compile the proyect I must  
> change in
> IndexSchema:
>
>   //private Similarity similarity;
>
>   AND PUT:
>
>   private SimilarityFactory similarityFactory;
>
> I'm doing something incorrectly or is a little bug?

It's because the patch is out of sync with trunk.   The  
SimilarityFactory was added recently.

Erik



solr-14 help

2008-06-24 Thread Geoffrey Young

hi all :)

last week I reworked an older patch for SOLR-14

  https://issues.apache.org/jira/browse/SOLR-14

this functionality is actually fairly important for our ongoing 
migration to solr, so I'd really love to get SOLR-14 into 1.3.  but 
open-source being what it is, my super-important feature is most 
people's not-so-important feature :)


anyway, I'm not a java programmer at all, so my work is probably 
sub-par.  but it seems like low-hanging fruit for reasonably skilled 
java folks, so if there is anyone out there willing to lend a hand or 
just waiting for a (simple) opportunity to get involved, I'd be much 
appreciative.  otherwise I guess it goes into my bin of locally applied 
patches.


thanks

--Geoff


Re: Attempting dataimport using FileListEntityProcessor

2008-06-24 Thread mike segv

I do want to import all documents.  My understanding of the way things work,
correct me if I'm wrong, is that there can be a certain number of documents
included in a single atomic update.  Instead of having all my 16 Million
documents be part of a single update (that could more easily fail being so
big), I was thinking that it would be better to be able to stipulate how
many docs are part of an update and my 16 Million doc import would consist
of 16M/100 updates.


Shalin Shekhar Mangar wrote:
> 
> Hi Mike,
> 
> Just curious to know the use-case here. Why do you want to limit updates
> to
> 100 instead of importing all documents?
> 
> On Tue, Jun 24, 2008 at 10:23 AM, mike segv <[EMAIL PROTECTED]> wrote:
> 
>>
>> That fixed it.
>>
>> If I'm inserting millions of documents, how do I control docs/update? 
>> E.g.
>> if there are 50K docs per file, I'm thinking that I should probably code
>> up
>> my own DataSource that allows me to stipulate docs/update.  Like say, 100
>> instead of 50K.  Does this make sense?
>>
>> Mike
>>
>>
>> Noble Paul നോബിള്‍ नोब्ळ् wrote:
>> >
>> > hi ,
>> > You have not registered any datasources . the second entity needs a
>> > datasource.
>> > Remove the dataSource="null"  and add a name for the second entity
>> > (good practice). No need for baseDir attribute for second entity .
>> > See the modified xml added below
>> > --Noble
>> >
>> > 
>> > 
>> > 
>> > > > newerThan="'NOW-10DAYS'" recursive="true" rootEntity="false"
>> > dataSource="null"  baseDir="/san/tomcat-services/solr-medline">
>> >  > > forEach="/MedlineCitation"
>> > url="${f.fileAbsolutePath}" >
>> > 
>> >  
>> > 
>> > 
>> > 
>> >
>> > On Tue, Jun 24, 2008 at 6:39 AM, mike segv <[EMAIL PROTECTED]> wrote:
>> >>
>> >> I'm trying to use the fileListEntityProcessor to add some xml
>> documents
>> >> to a
>> >> solr index.  I'm running a nightly version of solr-1.3 with SOLR-469
>> and
>> >> SOLR-563.  I've been able to successfuly run the slashdot
>> httpDataSource
>> >> example.  My data-config.xml file loads without errors.  When I
>> attempt
>> >> the
>> >> full-import command I get the exception below.  Thanks for any help.
>> >>
>> >> Mike
>> >>
>> >> WARNING: No lockType configured for
>> >> /san/tomcat-services/solr-medline/solr/data/index/ assuming 'simple'
>> >> Jun 23, 2008 7:59:49 PM
>> org.apache.solr.handler.dataimport.DataImporter
>> >> doFullImport
>> >> SEVERE: Full Import failed
>> >> java.lang.RuntimeException: java.lang.NullPointerException
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:97)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:212)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:166)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:149)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:286)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:312)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:140)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.rumCmd(DataImporter.java:386)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)
>> >> Caused by: java.lang.NullPointerException
>> >>at java.io.Reader.(Reader.java:61)
>> >>at java.io.BufferedReader.(BufferedReader.java:76)
>> >>at
>> com.bea.xml.stream.MXParser.checkForXMLDecl(MXParser.java:775)
>> >>at com.bea.xml.stream.MXParser.setInput(MXParser.java:806)
>> >>at
>> >>
>> com.bea.xml.stream.MXParserFactory.createXMLStreamReader(MXParserFactory.java:261)
>> >>at
>> >>
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:93)
>> >>... 10 more
>> >>
>> >> Here is my data-config:
>> >>
>> >> 
>> >> 
>> >> > >> newerThan="'NOW-10DAYS'" recursive="true" rootEntity="false"
>> >> dataSource="null" baseDi
>> >> r="/san/tomcat-services/solr-medline">
>> >>  > >> url="${f.fileAbsolutePath}" dataSource="null">
>> >> 
>> >>  
>> >> 
>> >> 
>> >> 
>> >>
>> >> And a snippet from an xml file:
>> >> 
>> >> 12236137
>> >> 
>> >> 1980
>> >> 01
>> >> 03
>> >> 
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Attempting-dataimport-using-FileListEntityProcessor-tp18081671p18081671.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > --Noble Paul
>> >
>> >
>>
>> --
>> View

Re: Attempting dataimport using FileListEntityProcessor

2008-06-24 Thread Shalin Shekhar Mangar
Ok, I got your point.

DataImportHandler currently creates documents and adds them one-by-one to
Solr. A commit/optimize is called once after all documents are finished. If
a document fails to add due to any exception then the import fails.

You can still achieve the functionality you want by setting maxDocs under
the autoCommit section in solrconfig.xml

On Tue, Jun 24, 2008 at 11:01 PM, mike segv <[EMAIL PROTECTED]> wrote:

>
> I do want to import all documents.  My understanding of the way things
> work,
> correct me if I'm wrong, is that there can be a certain number of documents
> included in a single atomic update.  Instead of having all my 16 Million
> documents be part of a single update (that could more easily fail being so
> big), I was thinking that it would be better to be able to stipulate how
> many docs are part of an update and my 16 Million doc import would consist
> of 16M/100 updates.
>
>
> Shalin Shekhar Mangar wrote:
> >
> > Hi Mike,
> >
> > Just curious to know the use-case here. Why do you want to limit updates
> > to
> > 100 instead of importing all documents?
> >
> > On Tue, Jun 24, 2008 at 10:23 AM, mike segv <[EMAIL PROTECTED]> wrote:
> >
> >>
> >> That fixed it.
> >>
> >> If I'm inserting millions of documents, how do I control docs/update?
> >> E.g.
> >> if there are 50K docs per file, I'm thinking that I should probably code
> >> up
> >> my own DataSource that allows me to stipulate docs/update.  Like say,
> 100
> >> instead of 50K.  Does this make sense?
> >>
> >> Mike
> >>
> >>
> >> Noble Paul നോബിള്‍ नोब्ळ् wrote:
> >> >
> >> > hi ,
> >> > You have not registered any datasources . the second entity needs a
> >> > datasource.
> >> > Remove the dataSource="null"  and add a name for the second entity
> >> > (good practice). No need for baseDir attribute for second entity .
> >> > See the modified xml added below
> >> > --Noble
> >> >
> >> > 
> >> > 
> >> > 
> >> >  >> > newerThan="'NOW-10DAYS'" recursive="true" rootEntity="false"
> >> > dataSource="null"  baseDir="/san/tomcat-services/solr-medline">
> >> >   >> > forEach="/MedlineCitation"
> >> > url="${f.fileAbsolutePath}" >
> >> > 
> >> >  
> >> > 
> >> > 
> >> > 
> >> >
> >> > On Tue, Jun 24, 2008 at 6:39 AM, mike segv <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >> I'm trying to use the fileListEntityProcessor to add some xml
> >> documents
> >> >> to a
> >> >> solr index.  I'm running a nightly version of solr-1.3 with SOLR-469
> >> and
> >> >> SOLR-563.  I've been able to successfuly run the slashdot
> >> httpDataSource
> >> >> example.  My data-config.xml file loads without errors.  When I
> >> attempt
> >> >> the
> >> >> full-import command I get the exception below.  Thanks for any help.
> >> >>
> >> >> Mike
> >> >>
> >> >> WARNING: No lockType configured for
> >> >> /san/tomcat-services/solr-medline/solr/data/index/ assuming 'simple'
> >> >> Jun 23, 2008 7:59:49 PM
> >> org.apache.solr.handler.dataimport.DataImporter
> >> >> doFullImport
> >> >> SEVERE: Full Import failed
> >> >> java.lang.RuntimeException: java.lang.NullPointerException
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:97)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:212)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:166)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:149)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:286)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:312)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:140)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.DataImporter.rumCmd(DataImporter.java:386)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:377)
> >> >> Caused by: java.lang.NullPointerException
> >> >>at java.io.Reader.(Reader.java:61)
> >> >>at java.io.BufferedReader.(BufferedReader.java:76)
> >> >>at
> >> com.bea.xml.stream.MXParser.checkForXMLDecl(MXParser.java:775)
> >> >>at com.bea.xml.stream.MXParser.setInput(MXParser.java:806)
> >> >>at
> >> >>
> >>
> com.bea.xml.stream.MXParserFactory.createXMLStreamReader(MXParserFactory.java:261)
> >> >>at
> >> >>
> >>
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:93)
> >> >>... 10 more
> >> >>
> >> >> He

Nutch <-> Solr latest?

2008-06-24 Thread Jon Baer

Hi,

Im curious, is there a spot / patch for the latest on Nutch / Solr  
integration, Ive found a few pages (a few outdated it seems), it would  
be nice (?) if it worked as a DataSource type to DataImportHandler,  
but not sure if that fits w/ how it works.  Either way a nice contrib  
patch the way the DIH is already setup would be nice to have.


Is there currently work ongoing on this?  Seems like it belongs in  
either / or project and not both.


Thanks.

- Jon


Re: SpellCheckComponent: No file-based suggestions + Location issue

2008-06-24 Thread Ronald K. Braun
Shalin:

> The index directory location is being created inside the current working
> directory. We should change that. I've opened SOLR-604 and attached a patch
> which fixes this.

I updated from nightly build to incorporate your fix and it works
perfectly, now building the spell indexes in solr/data.  Thanks!

Grant:

> What happens when you open the built index in Luke 
> (http://www.getopt.org/luke)?

Hmm, it looks a bit spacey -- I see the n-grams (n=3,4) but the text
looks interspersed with spaces.  Perhaps this is an artifact of Luke
or n-grams are supposed to be this way, but that would obviously seem
problematic.  Here are some snips:

 " h i s t o r y "
 " p i z z a "
 "i z"
 " i "

> Did you see any exceptions in your log?

Just a warning which I've ignored based on the discussions in SOLR-572:

WARNING: No fieldType: null found for dictionary: external.  Using
WhitespaceAnalzyer.

Oddly, even if I specify a fieldType with a legitimate field type
(e.g., spell) from my schema.xml, this same warning is thrown, so I
assume the parameter is functionless.

WARNING: No fieldType: spell found for dictionary: external.  Using
WhitespaceAnalzyer.










Ron


Re: Wildcard search question

2008-06-24 Thread Jon Drukman

Norberto Meijome wrote:
ok well let's say that i can live without john/jon in the short term. 
what i really need today is a case insensitive wildcard search with 
literal matching (no fancy stemming.  bobby is bobby, not bobbi.)


what are my options?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

define your own type (or modify text / string... but I find that it gets 
confusing to have variations of text / string ...) to perform the operations on 
the content as needed.

There are also other tokenizer/analysers available that *may* help in the 
partial searches (ngram , edgengram ), but there isn't much documentation on 
them yet (that I could find) - I am only getting into them myself i'll see 
how it goes..


thanks, that got me on the right track.  i came up with this:


  


  
  


  


now searching for user_name:bobby* works as i wanted.

my next question: is there a way that i can score matches that are at 
the start of the string higher than matches in the middle?  for example, 
if i search for steve, i get kelly stevenson before steve jobs.  i'd 
like steve jobs to come first.


-jsd-



Re: How to use SOLR1.2

2008-06-24 Thread Chris Hostetter

:  I am new in SOLR 1.2, configured Admin GUI. Facing problem in using 
: this. could you pls help me out to configure the nex.


the admin GUI isn't really a place where you configure Solr.  It's a way 
to see the status of things -- configuration is done via config files.

have you con through the tutorial?

http://lucene.apache.org/solr/tutorial.html

if you have some specific questions about problems you are having, please 
post more detailed questions.


-Hoss



Re: UnicodeNormalizationFilterFactory

2008-06-24 Thread Chris Hostetter

: I've seen mention of these filters:
: 
:  
:  

Are you asking because you saw these in Robert Haschart's reply to your 
previous question?  I think those are custom Filters that he has in his 
project ... not open source (but i may be wrong)

they are certainly not something that comes out of the box w/ Solr.


-Hoss



Can I add field compression without reindexing?

2008-06-24 Thread Chris Harris
I have an index that I eventually want to rebuild so I can set
compressed=true on a couple of fields. It's not really practical to rebuild
the whole thing right now, though. If I change my schema.xml to set
compressed=true and then keep adding new data to the existing index, will
this corrupt the index, or will the *new* data be stored in compressed
format, even while the old data is not compressed?


Re: Can I specify the default operator at query time ?

2008-06-24 Thread Chris Hostetter

: Subject: Can I specify the default operator at query time ?
: In-Reply-To: <[EMAIL PROTECTED]>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss



DataImportHandler running out of memory

2008-06-24 Thread wojtekpia

I'm trying to load ~10 million records into Solr using the DataImportHandler.
I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as
soon as I try loading more than about 5 million records.

Here's my configuration:
I'm connecting to a SQL Server database using the sqljdbc driver. I've given
my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to
1. My SQL query is "select top XXX field1, ... from table1". I have
about 40 fields in my Solr schema.

I thought the DataImportHandler would stream data from the DB rather than
loading it all into memory at once. Is that not the case? Any thoughts on
how to get around this (aside from getting a machine with more memory)? 

-- 
View this message in context: 
http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: DataImportHandler running out of memory

2008-06-24 Thread Grant Ingersoll
This is a bug in MySQL.  Try setting the Fetch Size the Statement on  
the connection to Integer.MIN_VALUE.


See http://forums.mysql.com/read.php?39,137457 amongst a host of other  
discussions on the subject.  Basically, it tries to load all the rows  
into memory, the only alternative is to set the fetch size to  
Integer.MIN_VALUE so that it gets it one row at a time.  I've hit this  
one myself and it isn't caused by the DataImportHandler, but by the  
MySQL JDBC handler.


-Grant



On Jun 24, 2008, at 8:23 PM, wojtekpia wrote:



I'm trying to load ~10 million records into Solr using the  
DataImportHandler.
I'm running out of memory (java.lang.OutOfMemoryError: Java heap  
space) as

soon as I try loading more than about 5 million records.

Here's my configuration:
I'm connecting to a SQL Server database using the sqljdbc driver.  
I've given
my Solr instance 1.5 GB of memory. I have set the dataSource  
batchSize to
1. My SQL query is "select top XXX field1, ... from table1". I  
have

about 40 fields in my Solr schema.

I thought the DataImportHandler would stream data from the DB rather  
than
loading it all into memory at once. Is that not the case? Any  
thoughts on
how to get around this (aside from getting a machine with more  
memory)?


--
View this message in context: 
http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









How to debug ?

2008-06-24 Thread Norberto Meijome
hi,
I'm trying to understand why a search on a field tokenized with the nGram
tokenizer, with minGramSize=n and maxGramSize=m doesn't find any matches for
queries of length (in characters) of n+1..m (n works fine).

analysis.jsp shows that it SHOULD match, but /select doesn't bring anything
back. (For details on this queries, please see my previous post over the last
day or so to this list).

So i figure there is some difference between what analysis.jsp does and the
actual search executed , or what lucene indexes - i imagine analysis.jsp only
parses the input in the page with solr's tokenizers/filters but doesn't
actually do lucene's part of the job.

And I'd like to look into this... What is the suggested approach for this?
attach a debugger to jetty's web app ? Are there some pointers on how to debug
at this level? Preferably in Eclipse, but beggars cant be choosers ;)

thanks!!
B
_
{Beto|Norberto|Numard} Meijome

"Always do right.  This will gratify some and astonish the rest."
  Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: How to debug ?

2008-06-24 Thread Ryan McKinley

also, check the LukeRequestHandler

if there is a document you think *should* match, you can see what  
tokens it has actually indexed...



On Jun 24, 2008, at 7:12 PM, Norberto Meijome wrote:

hi,
I'm trying to understand why a search on a field tokenized with the  
nGram
tokenizer, with minGramSize=n and maxGramSize=m doesn't find any  
matches for

queries of length (in characters) of n+1..m (n works fine).

analysis.jsp shows that it SHOULD match, but /select doesn't bring  
anything
back. (For details on this queries, please see my previous post over  
the last

day or so to this list).

So i figure there is some difference between what analysis.jsp does  
and the
actual search executed , or what lucene indexes - i imagine  
analysis.jsp only
parses the input in the page with solr's tokenizers/filters but  
doesn't

actually do lucene's part of the job.

And I'd like to look into this... What is the suggested approach for  
this?
attach a debugger to jetty's web app ? Are there some pointers on  
how to debug

at this level? Preferably in Eclipse, but beggars cant be choosers ;)

thanks!!
B
_
{Beto|Norberto|Numard} Meijome

"Always do right.  This will gratify some and astonish the rest."
 Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery  
when wet.
Reading disclaimers makes you go blind. Writing them is worse. You  
have been

Warned.




Re: DataImportHandler running out of memory

2008-06-24 Thread Shalin Shekhar Mangar
Setting the batchSize to 1 would mean that the Jdbc driver will keep
1 rows in memory *for each entity* which uses that data source (if
correctly implemented by the driver). Not sure how well the Sql Server
driver implements this. Also keep in mind that Solr also needs memory to
index documents. You can probably try setting the batch size to a lower
value.

The regular memory tuning stuff should apply here too -- try disabling
autoCommit and turn-off autowarming and see if it helps.

On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote:

>
> I'm trying to load ~10 million records into Solr using the
> DataImportHandler.
> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as
> soon as I try loading more than about 5 million records.
>
> Here's my configuration:
> I'm connecting to a SQL Server database using the sqljdbc driver. I've
> given
> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to
> 1. My SQL query is "select top XXX field1, ... from table1". I have
> about 40 fields in my Solr schema.
>
> I thought the DataImportHandler would stream data from the DB rather than
> loading it all into memory at once. Is that not the case? Any thoughts on
> how to get around this (aside from getting a machine with more memory)?
>
> --
> View this message in context:
> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: How to debug ?

2008-06-24 Thread Norberto Meijome
On Tue, 24 Jun 2008 19:17:58 -0700
Ryan McKinley <[EMAIL PROTECTED]> wrote:

> also, check the LukeRequestHandler
> 
> if there is a document you think *should* match, you can see what  
> tokens it has actually indexed...

right, I will look into that a bit more. 

I am actually using the lukeall.jar (0.8.1, linked against lucene 2.4) to look
into what got indexed, but I am bit wary of how what I select in the the
'analyzer' drop down option in Luke actually affects what I see.

B

_
{Beto|Norberto|Numard} Meijome

"Web2.0 is outsourced R&D from Web1.0 companies."
   The Reverend

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


RE: UnicodeNormalizationFilterFactory

2008-06-24 Thread Lance Norskog
ISOLatin1AccentFilterFactory works quite well for us. It solves our basic
euro-text keyboard searching problem, where "protege" should find protégé.
("protege" with two accents.)

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 24, 2008 4:05 PM
To: solr-user@lucene.apache.org
Subject: Re: UnicodeNormalizationFilterFactory


: I've seen mention of these filters:
:
:  
:  

Are you asking because you saw these in Robert Haschart's reply to your
previous question?  I think those are custom Filters that he has in his
project ... not open source (but i may be wrong)

they are certainly not something that comes out of the box w/ Solr.


-Hoss






Re: DataImportHandler running out of memory

2008-06-24 Thread Noble Paul നോബിള്‍ नोब्ळ्
DIH streams rows one by one.
set the fetchSize="-1" this might help. It may make the indexing a bit
slower but memory consumption would be low.
The memory is consumed by the jdbc driver. try tuning the -Xmx value for the VM
--Noble

On Wed, Jun 25, 2008 at 8:05 AM, Shalin Shekhar Mangar
<[EMAIL PROTECTED]> wrote:
> Setting the batchSize to 1 would mean that the Jdbc driver will keep
> 1 rows in memory *for each entity* which uses that data source (if
> correctly implemented by the driver). Not sure how well the Sql Server
> driver implements this. Also keep in mind that Solr also needs memory to
> index documents. You can probably try setting the batch size to a lower
> value.
>
> The regular memory tuning stuff should apply here too -- try disabling
> autoCommit and turn-off autowarming and see if it helps.
>
> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote:
>
>>
>> I'm trying to load ~10 million records into Solr using the
>> DataImportHandler.
>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as
>> soon as I try loading more than about 5 million records.
>>
>> Here's my configuration:
>> I'm connecting to a SQL Server database using the sqljdbc driver. I've
>> given
>> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to
>> 1. My SQL query is "select top XXX field1, ... from table1". I have
>> about 40 fields in my Solr schema.
>>
>> I thought the DataImportHandler would stream data from the DB rather than
>> loading it all into memory at once. Is that not the case? Any thoughts on
>> how to get around this (aside from getting a machine with more memory)?
>>
>> --
>> View this message in context:
>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul


Re: DataImportHandler running out of memory

2008-06-24 Thread Noble Paul നോബിള്‍ नोब्ळ्
it is batchSize="-1" not fetchSize. Or keep it to a very small value.
--Noble

On Wed, Jun 25, 2008 at 9:31 AM, Noble Paul നോബിള്‍ नोब्ळ्
<[EMAIL PROTECTED]> wrote:
> DIH streams rows one by one.
> set the fetchSize="-1" this might help. It may make the indexing a bit
> slower but memory consumption would be low.
> The memory is consumed by the jdbc driver. try tuning the -Xmx value for the 
> VM
> --Noble
>
> On Wed, Jun 25, 2008 at 8:05 AM, Shalin Shekhar Mangar
> <[EMAIL PROTECTED]> wrote:
>> Setting the batchSize to 1 would mean that the Jdbc driver will keep
>> 1 rows in memory *for each entity* which uses that data source (if
>> correctly implemented by the driver). Not sure how well the Sql Server
>> driver implements this. Also keep in mind that Solr also needs memory to
>> index documents. You can probably try setting the batch size to a lower
>> value.
>>
>> The regular memory tuning stuff should apply here too -- try disabling
>> autoCommit and turn-off autowarming and see if it helps.
>>
>> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote:
>>
>>>
>>> I'm trying to load ~10 million records into Solr using the
>>> DataImportHandler.
>>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as
>>> soon as I try loading more than about 5 million records.
>>>
>>> Here's my configuration:
>>> I'm connecting to a SQL Server database using the sqljdbc driver. I've
>>> given
>>> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to
>>> 1. My SQL query is "select top XXX field1, ... from table1". I have
>>> about 40 fields in my Solr schema.
>>>
>>> I thought the DataImportHandler would stream data from the DB rather than
>>> loading it all into memory at once. Is that not the case? Any thoughts on
>>> how to get around this (aside from getting a machine with more memory)?
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>
>
>
> --
> --Noble Paul
>



-- 
--Noble Paul


Re: How to debug ?

2008-06-24 Thread Brian Carmalt
Hello Beto,

There is a plugin for jetty: http://webtide.com/eclipse. Insert this as
and update site and let eclipse install the plugin for you You can then
start the jetty server from eclipse and debug it. 

Brian. 

Am Mittwoch, den 25.06.2008, 12:48 +1000 schrieb Norberto Meijome:
> On Tue, 24 Jun 2008 19:17:58 -0700
> Ryan McKinley <[EMAIL PROTECTED]> wrote:
> 
> > also, check the LukeRequestHandler
> > 
> > if there is a document you think *should* match, you can see what  
> > tokens it has actually indexed...
> 
> right, I will look into that a bit more. 
> 
> I am actually using the lukeall.jar (0.8.1, linked against lucene 2.4) to look
> into what got indexed, but I am bit wary of how what I select in the the
> 'analyzer' drop down option in Luke actually affects what I see.
> 
> B
> 
> _
> {Beto|Norberto|Numard} Meijome
> 
> "Web2.0 is outsourced R&D from Web1.0 companies."
>The Reverend
> 
> I speak for myself, not my employer. Contents may be hot. Slippery when wet.
> Reading disclaimers makes you go blind. Writing them is worse. You have been
> Warned.



Re: How to debug ?

2008-06-24 Thread Norberto Meijome
On Wed, 25 Jun 2008 08:37:35 +0200
Brian Carmalt <[EMAIL PROTECTED]> wrote:

> There is a plugin for jetty: http://webtide.com/eclipse. Insert this as
> and update site and let eclipse install the plugin for you You can then
> start the jetty server from eclipse and debug it. 

Thanks Brian, good information :)

B

_
{Beto|Norberto|Numard} Meijome

Q. How do you make God laugh?
A. Tell him your plans.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.