Solr spell-checker

2008-06-25 Thread dudes dudes

Hello all, 

I have Solr 1.2 installed and I was wandering how Solr 1.2 deals with checking 
miss spelled strings and also how to configure it ?
appreciate any docs on this topic ..

thanks a lot
ak
_

All new Live Search at Live.com

http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/

Re: Solr spell-checker

2008-06-25 Thread Shalin Shekhar Mangar
Take a look at http://wiki.apache.org/solr/SpellCheckerRequestHandler

If you can use a nightly build of Solr 1.3 then you can use the new and
better http://wiki.apache.org/solr/SpellCheckComponent

On Wed, Jun 25, 2008 at 2:36 PM, dudes dudes <[EMAIL PROTECTED]> wrote:

>
> Hello all,
>
> I have Solr 1.2 installed and I was wandering how Solr 1.2 deals with
> checking miss spelled strings and also how to configure it ?
> appreciate any docs on this topic ..
>
> thanks a lot
> ak
> _
>
> All new Live Search at Live.com
>
> http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/




-- 
Regards,
Shalin Shekhar Mangar.


Re: DataImportHandler running out of memory

2008-06-25 Thread Grant Ingersoll
I think it's a bit different.  I ran into this exact problem about two  
weeks ago on a 13 million record DB.  MySQL doesn't honor the fetch  
size for it's v5 JDBC driver.


See http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ 
 or do a search for MySQL fetch size.


You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't  
work) in order to get streaming in MySQL.


-Grant


On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote:

Setting the batchSize to 1 would mean that the Jdbc driver will  
keep

1 rows in memory *for each entity* which uses that data source (if
correctly implemented by the driver). Not sure how well the Sql Server
driver implements this. Also keep in mind that Solr also needs  
memory to
index documents. You can probably try setting the batch size to a  
lower

value.

The regular memory tuning stuff should apply here too -- try disabling
autoCommit and turn-off autowarming and see if it helps.

On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]>  
wrote:




I'm trying to load ~10 million records into Solr using the
DataImportHandler.
I'm running out of memory (java.lang.OutOfMemoryError: Java heap  
space) as

soon as I try loading more than about 5 million records.

Here's my configuration:
I'm connecting to a SQL Server database using the sqljdbc driver.  
I've

given
my Solr instance 1.5 GB of memory. I have set the dataSource  
batchSize to
1. My SQL query is "select top XXX field1, ... from table1". I  
have

about 40 fields in my Solr schema.

I thought the DataImportHandler would stream data from the DB  
rather than
loading it all into memory at once. Is that not the case? Any  
thoughts on
how to get around this (aside from getting a machine with more  
memory)?


--
View this message in context:
http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
Sent from the Solr - User mailing list archive at Nabble.com.





--
Regards,
Shalin Shekhar Mangar.


--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









Re: DataImportHandler running out of memory

2008-06-25 Thread Grant Ingersoll
I'm assuming, of course, that the DIH doesn't automatically modify the  
SQL statement according to the batch size.


-Grant

On Jun 25, 2008, at 7:05 AM, Grant Ingersoll wrote:

I think it's a bit different.  I ran into this exact problem about  
two weeks ago on a 13 million record DB.  MySQL doesn't honor the  
fetch size for it's v5 JDBC driver.


See http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ 
 or do a search for MySQL fetch size.


You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't  
work) in order to get streaming in MySQL.


-Grant


On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote:

Setting the batchSize to 1 would mean that the Jdbc driver will  
keep
1 rows in memory *for each entity* which uses that data source  
(if
correctly implemented by the driver). Not sure how well the Sql  
Server
driver implements this. Also keep in mind that Solr also needs  
memory to
index documents. You can probably try setting the batch size to a  
lower

value.

The regular memory tuning stuff should apply here too -- try  
disabling

autoCommit and turn-off autowarming and see if it helps.

On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]>  
wrote:




I'm trying to load ~10 million records into Solr using the
DataImportHandler.
I'm running out of memory (java.lang.OutOfMemoryError: Java heap  
space) as

soon as I try loading more than about 5 million records.

Here's my configuration:
I'm connecting to a SQL Server database using the sqljdbc driver.  
I've

given
my Solr instance 1.5 GB of memory. I have set the dataSource  
batchSize to
1. My SQL query is "select top XXX field1, ... from table1". I  
have

about 40 fields in my Solr schema.

I thought the DataImportHandler would stream data from the DB  
rather than
loading it all into memory at once. Is that not the case? Any  
thoughts on
how to get around this (aside from getting a machine with more  
memory)?


--
View this message in context:
http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
Sent from the Solr - User mailing list archive at Nabble.com.





--
Regards,
Shalin Shekhar Mangar.


--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ









RE: Solr spell-checker

2008-06-25 Thread dudes dudes

thanks for your kind reply


> Date: Wed, 25 Jun 2008 14:48:38 +0530
> From: [EMAIL PROTECTED]
> To: solr-user@lucene.apache.org
> Subject: Re: Solr spell-checker
> 
> Take a look at http://wiki.apache.org/solr/SpellCheckerRequestHandler
> 
> If you can use a nightly build of Solr 1.3 then you can use the new and
> better http://wiki.apache.org/solr/SpellCheckComponent
> 
> On Wed, Jun 25, 2008 at 2:36 PM, dudes dudes  wrote:
> 
>>
>> Hello all,
>>
>> I have Solr 1.2 installed and I was wandering how Solr 1.2 deals with
>> checking miss spelled strings and also how to configure it ?
>> appreciate any docs on this topic ..
>>
>> thanks a lot
>> ak
>> _
>>
>> All new Live Search at Live.com
>>
>> http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/
> 
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.

_

http://clk.atdmt.com/UKM/go/msnnkmgl001002ukm/direct/01/

Re: DataImportHandler running out of memory

2008-06-25 Thread Shalin Shekhar Mangar
The OP is actually using Sql Server (not MySql) as per his mail.

On Wed, Jun 25, 2008 at 4:40 PM, Grant Ingersoll <[EMAIL PROTECTED]>
wrote:

> I'm assuming, of course, that the DIH doesn't automatically modify the SQL
> statement according to the batch size.
>
> -Grant
>
>
> On Jun 25, 2008, at 7:05 AM, Grant Ingersoll wrote:
>
>  I think it's a bit different.  I ran into this exact problem about two
>> weeks ago on a 13 million record DB.  MySQL doesn't honor the fetch size for
>> it's v5 JDBC driver.
>>
>> See
>> http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ or
>> do a search for MySQL fetch size.
>>
>> You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't work)
>> in order to get streaming in MySQL.
>>
>> -Grant
>>
>>
>> On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote:
>>
>>  Setting the batchSize to 1 would mean that the Jdbc driver will keep
>>> 1 rows in memory *for each entity* which uses that data source (if
>>> correctly implemented by the driver). Not sure how well the Sql Server
>>> driver implements this. Also keep in mind that Solr also needs memory to
>>> index documents. You can probably try setting the batch size to a lower
>>> value.
>>>
>>> The regular memory tuning stuff should apply here too -- try disabling
>>> autoCommit and turn-off autowarming and see if it helps.
>>>
>>> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote:
>>>
>>>
 I'm trying to load ~10 million records into Solr using the
 DataImportHandler.
 I'm running out of memory (java.lang.OutOfMemoryError: Java heap space)
 as
 soon as I try loading more than about 5 million records.

 Here's my configuration:
 I'm connecting to a SQL Server database using the sqljdbc driver. I've
 given
 my Solr instance 1.5 GB of memory. I have set the dataSource batchSize
 to
 1. My SQL query is "select top XXX field1, ... from table1". I have
 about 40 fields in my Solr schema.

 I thought the DataImportHandler would stream data from the DB rather
 than
 loading it all into memory at once. Is that not the case? Any thoughts
 on
 how to get around this (aside from getting a machine with more memory)?

 --
 View this message in context:

 http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
 Sent from the Solr - User mailing list archive at Nabble.com.



>>>
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>>>
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: DataImportHandler running out of memory

2008-06-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
DIH does not modify SQL. This value is used as a connection property
--Noble

On Wed, Jun 25, 2008 at 4:40 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> I'm assuming, of course, that the DIH doesn't automatically modify the SQL
> statement according to the batch size.
>
> -Grant
>
> On Jun 25, 2008, at 7:05 AM, Grant Ingersoll wrote:
>
>> I think it's a bit different.  I ran into this exact problem about two
>> weeks ago on a 13 million record DB.  MySQL doesn't honor the fetch size for
>> it's v5 JDBC driver.
>>
>> See
>> http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ or
>> do a search for MySQL fetch size.
>>
>> You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't work)
>> in order to get streaming in MySQL.
>>
>> -Grant
>>
>>
>> On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote:
>>
>>> Setting the batchSize to 1 would mean that the Jdbc driver will keep
>>> 1 rows in memory *for each entity* which uses that data source (if
>>> correctly implemented by the driver). Not sure how well the Sql Server
>>> driver implements this. Also keep in mind that Solr also needs memory to
>>> index documents. You can probably try setting the batch size to a lower
>>> value.
>>>
>>> The regular memory tuning stuff should apply here too -- try disabling
>>> autoCommit and turn-off autowarming and see if it helps.
>>>
>>> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote:
>>>

 I'm trying to load ~10 million records into Solr using the
 DataImportHandler.
 I'm running out of memory (java.lang.OutOfMemoryError: Java heap space)
 as
 soon as I try loading more than about 5 million records.

 Here's my configuration:
 I'm connecting to a SQL Server database using the sqljdbc driver. I've
 given
 my Solr instance 1.5 GB of memory. I have set the dataSource batchSize
 to
 1. My SQL query is "select top XXX field1, ... from table1". I have
 about 40 fields in my Solr schema.

 I thought the DataImportHandler would stream data from the DB rather
 than
 loading it all into memory at once. Is that not the case? Any thoughts
 on
 how to get around this (aside from getting a machine with more memory)?

 --
 View this message in context:

 http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
 Sent from the Solr - User mailing list archive at Nabble.com.


>>>
>>>
>>> --
>>> Regards,
>>> Shalin Shekhar Mangar.
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>



-- 
--Noble Paul


Re: DataImportHandler running out of memory

2008-06-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
The latest patch sets fetchSize as Integer.MIN_VALUE if -1 is passed.
It is added specifically for mysql driver
--Noble

On Wed, Jun 25, 2008 at 4:35 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> I think it's a bit different.  I ran into this exact problem about two weeks
> ago on a 13 million record DB.  MySQL doesn't honor the fetch size for it's
> v5 JDBC driver.
>
> See
> http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ or
> do a search for MySQL fetch size.
>
> You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't work) in
> order to get streaming in MySQL.
>
> -Grant
>
>
> On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote:
>
>> Setting the batchSize to 1 would mean that the Jdbc driver will keep
>> 1 rows in memory *for each entity* which uses that data source (if
>> correctly implemented by the driver). Not sure how well the Sql Server
>> driver implements this. Also keep in mind that Solr also needs memory to
>> index documents. You can probably try setting the batch size to a lower
>> value.
>>
>> The regular memory tuning stuff should apply here too -- try disabling
>> autoCommit and turn-off autowarming and see if it helps.
>>
>> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote:
>>
>>>
>>> I'm trying to load ~10 million records into Solr using the
>>> DataImportHandler.
>>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space)
>>> as
>>> soon as I try loading more than about 5 million records.
>>>
>>> Here's my configuration:
>>> I'm connecting to a SQL Server database using the sqljdbc driver. I've
>>> given
>>> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to
>>> 1. My SQL query is "select top XXX field1, ... from table1". I have
>>> about 40 fields in my Solr schema.
>>>
>>> I thought the DataImportHandler would stream data from the DB rather than
>>> loading it all into memory at once. Is that not the case? Any thoughts on
>>> how to get around this (aside from getting a machine with more memory)?
>>>
>>> --
>>> View this message in context:
>>>
>>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ


Re: How to debug ?

2008-06-25 Thread Norberto Meijome
On Tue, 24 Jun 2008 19:17:58 -0700
Ryan McKinley <[EMAIL PROTECTED]> wrote:

> also, check the LukeRequestHandler
> 
> if there is a document you think *should* match, you can see what  
> tokens it has actually indexed...
> 

hi Ryan,
I can't see the tokens generated using LukeRequestHandler.

I can get to the document I want : 
http://localhost:8983/solr/_test_/admin/luke/?id=Jay%20Rock

and for the field I am interested , i get only :
[...]

ngram
ITS--
ITS--
Jay Rock
Jay Rock
1.0
0

[...]

( all the other fields look pretty much identical , none of them show the 
tokens generated).

using the luke tool itself ( lukeall.jar ,source # 0.8.1, linked against 
Lucene's 2.4 libs bundled with the nightly build), I see the following tokens, 
for this document + field:

ja, ay, y ,  r, ro, 
oc, ck, jay, ay , y r, 
 ro, roc, ock, jay , ay r, 
y ro,  roc, rock, jay r, ay ro, 
y roc,  rock, jay ro, ay roc, y rock, 
jay roc, ay rock, jay rock

Which is precisely what I expect, given that my 'ngram' type is defined as :
















My question now is, was I supposed to get any more information from 
LukeRequestHandler ?


furthermore, if I perform , on this same core with exactly this data :
http://localhost:8983/solr/_test_/select?q=artist_ngram:ro

I get this document returned (and many others).

but, if I search for 'roc' instead of 'ro' :
http://localhost:8983/solr/_test_/select?q=artist_ngram:roc

−

−

0
48
−

artist_ngram:roc
true



−

artist_ngram:roc
artist_ngram:roc
PhraseQuery(artist_ngram:"ro oc roc")
artist_ngram:"ro oc roc"

OldLuceneQParser
−

.[...]

Is searching on nGram tokenized fields  limited to the minGramSize ?

Thanks for any pointers you can provide,
B
_
{Beto|Norberto|Numard} Meijome

"I didn't attend the funeral, but I sent a nice letter saying  I approved of 
it."
  Mark Twain

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Lucene 2.4-dev source ?

2008-06-25 Thread Norberto Meijome
Hi,
where can I find these sources? I have the binary jars included with the 
nightly builds,but I'd like to look @ the code of some of the objects.

In particular, 
http://svn.apache.org/viewvc/lucene/java/

doesnt have any reference to 2.4, and 
http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/
 doesn't include org.apache.lucene.analysis.ngram.NGramTokenFilter ,which is 
one of what I am after...

thanks!
B
_
{Beto|Norberto|Numard} Meijome

Real Programmers don't comment their code. If it was hard to write, it should 
be hard to understand and even harder to modify.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: Lucene 2.4-dev source ?

2008-06-25 Thread Yonik Seeley
trunk is the latest version (which is currently 2.4-dev).
http://svn.apache.org/viewvc/lucene/java/trunk/

There is a contrib directory with things not in lucene-core:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/

-Yonik


Re: DataImportHandler running out of memory

2008-06-25 Thread wojtekpia

I'm trying with batchSize=-1 now. So far it seems to be working, but very
slowly. I will update when it completes or crashes.

Even with a batchSize of 100 I was running out of memory.

I'm running on a 32-bit Windows machine. I've set the -Xmx to 1.5 GB - I
believe that's the maximum for my environment.

The batchSize parameter doesn't seem to control what happens... when I
select top 5,000,000 with a batchSize of 10,000, it works. When I select top
10,000,000 with the same batchSize, it runs out of memory.

Also, I'm using the 469 patch posted on 2008-06-11 08:41 AM.


Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> DIH streams rows one by one.
> set the fetchSize="-1" this might help. It may make the indexing a bit
> slower but memory consumption would be low.
> The memory is consumed by the jdbc driver. try tuning the -Xmx value for
> the VM
> --Noble
> 
> On Wed, Jun 25, 2008 at 8:05 AM, Shalin Shekhar Mangar
> <[EMAIL PROTECTED]> wrote:
>> Setting the batchSize to 1 would mean that the Jdbc driver will keep
>> 1 rows in memory *for each entity* which uses that data source (if
>> correctly implemented by the driver). Not sure how well the Sql Server
>> driver implements this. Also keep in mind that Solr also needs memory to
>> index documents. You can probably try setting the batch size to a lower
>> value.
>>
>> The regular memory tuning stuff should apply here too -- try disabling
>> autoCommit and turn-off autowarming and see if it helps.
>>
>> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote:
>>
>>>
>>> I'm trying to load ~10 million records into Solr using the
>>> DataImportHandler.
>>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space)
>>> as
>>> soon as I try loading more than about 5 million records.
>>>
>>> Here's my configuration:
>>> I'm connecting to a SQL Server database using the sqljdbc driver. I've
>>> given
>>> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize
>>> to
>>> 1. My SQL query is "select top XXX field1, ... from table1". I have
>>> about 40 fields in my Solr schema.
>>>
>>> I thought the DataImportHandler would stream data from the DB rather
>>> than
>>> loading it all into memory at once. Is that not the case? Any thoughts
>>> on
>>> how to get around this (aside from getting a machine with more memory)?
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
> 
> 
> 
> -- 
> --Noble Paul
> 
> 

-- 
View this message in context: 
http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18115900.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: DataImportHandler running out of memory

2008-06-25 Thread Shalin Shekhar Mangar
Hi,

I don't think the problem is within DataImportHandler since it just streams
resultset. The fetchSize is just passed as a parameter passed to
Statement#setFetchSize() and the Jdbc driver is supposed to honor it and
keep only that many rows in memory.

From what I could find about the Sql Server driver -- there's a connection
property called responseBuffering whose default value is "full" which causes
the entire result set is fetched. See
http://msdn.microsoft.com/en-us/library/ms378988.aspx for more details. You
can set connection properties like this directly in the jdbc url specified
in DataImportHandler's dataSource configuration.

On Wed, Jun 25, 2008 at 10:17 PM, wojtekpia <[EMAIL PROTECTED]> wrote:

>
> I'm trying with batchSize=-1 now. So far it seems to be working, but very
> slowly. I will update when it completes or crashes.
>
> Even with a batchSize of 100 I was running out of memory.
>
> I'm running on a 32-bit Windows machine. I've set the -Xmx to 1.5 GB - I
> believe that's the maximum for my environment.
>
> The batchSize parameter doesn't seem to control what happens... when I
> select top 5,000,000 with a batchSize of 10,000, it works. When I select
> top
> 10,000,000 with the same batchSize, it runs out of memory.
>
> Also, I'm using the 469 patch posted on 2008-06-11 08:41 AM.
>
>
> Noble Paul നോബിള്‍ नोब्ळ् wrote:
> >
> > DIH streams rows one by one.
> > set the fetchSize="-1" this might help. It may make the indexing a bit
> > slower but memory consumption would be low.
> > The memory is consumed by the jdbc driver. try tuning the -Xmx value for
> > the VM
> > --Noble
> >
> > On Wed, Jun 25, 2008 at 8:05 AM, Shalin Shekhar Mangar
> > <[EMAIL PROTECTED]> wrote:
> >> Setting the batchSize to 1 would mean that the Jdbc driver will keep
> >> 1 rows in memory *for each entity* which uses that data source (if
> >> correctly implemented by the driver). Not sure how well the Sql Server
> >> driver implements this. Also keep in mind that Solr also needs memory to
> >> index documents. You can probably try setting the batch size to a lower
> >> value.
> >>
> >> The regular memory tuning stuff should apply here too -- try disabling
> >> autoCommit and turn-off autowarming and see if it helps.
> >>
> >> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]>
> wrote:
> >>
> >>>
> >>> I'm trying to load ~10 million records into Solr using the
> >>> DataImportHandler.
> >>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space)
> >>> as
> >>> soon as I try loading more than about 5 million records.
> >>>
> >>> Here's my configuration:
> >>> I'm connecting to a SQL Server database using the sqljdbc driver. I've
> >>> given
> >>> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize
> >>> to
> >>> 1. My SQL query is "select top XXX field1, ... from table1". I have
> >>> about 40 fields in my Solr schema.
> >>>
> >>> I thought the DataImportHandler would stream data from the DB rather
> >>> than
> >>> loading it all into memory at once. Is that not the case? Any thoughts
> >>> on
> >>> how to get around this (aside from getting a machine with more memory)?
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
> >>> Sent from the Solr - User mailing list archive at Nabble.com.
> >>>
> >>>
> >>
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
> >>
> >
> >
> >
> > --
> > --Noble Paul
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18115900.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: "document commit" possible?

2008-06-25 Thread Chris Hostetter

: With the understanding that queries for newly indexed fields in this document
: will not return this newly added document, but a query for the document by its
: id will return any new stored fields. When the "real" commit (read: the commit
: that takes 10 minutes to complete) returns the newly indexed fields will be
: query-able.

this would be ... non-trivial.  Right now stored fields and indexed fields 
are all handled using Lucene.  I seem to recall some discussions on 
solr-dev a while back of adding some alternate field store mechanisms, in 
which case i can *imagine* a field store that would be could immediately 
on "add" ... but it's just theoretical.



-Hoss



Re: Nutch <-> Solr latest?

2008-06-25 Thread Chris Hostetter

: Im curious, is there a spot / patch for the latest on Nutch / Solr
: integration, Ive found a few pages (a few outdated it seems), it would be nice
: (?) if it worked as a DataSource type to DataImportHandler, but not sure if
: that fits w/ how it works.  Either way a nice contrib patch the way the DIH is
: already setup would be nice to have.
...
: Is there currently work ongoing on this?  Seems like it belongs in either / or
: project and not both.

My understanding is that previous wok on bridging Nutch crawling with Solr 
indexing involved patching Nutch and using a Nutch specific schema.xml and 
the client code which has since been committed as "SolrJ".

Most of the discussion seemed to take place on the Nutch list (which makes 
sense since Nutch required the patching) so you may wnt to start there).

I'm not sure if Nutch itegration would make sense as a DIH plugin (it 
seems like the Nutch crawler could "push" the data much more easily then 
DIH could pull it from the crawler) but if there is any advantage to 
having plugin code running in Solr to support this then that would 
absolutely make sense in the new /contrib area of solr (that i believe 
Otis already created/commited) but any nutch "plugins" or modifications 
would obviously need to be made in Nutch.

-Hoss



NGramTokenizer issue

2008-06-25 Thread Jonathan Ariel
Hi,
I've been trying to use the NGramTokenizer and I ran into a problem.
It seems like solr is trying to match documents with all the tokens that the
analyzer returns from the query term. So if I index a document with a title
field with the value "nice dog" and search for "dog" (where the
NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't get
any results.
I can see in the Analysis tool that the tokenizer generates the right
tokens, but then when solr searches it tries to match the exact Phrase
instead of the tokens.

I tried the same in Lucene and it works as expected. So it seems to be a
Solr issue. Any hint of where should I look in order to fix it?

Here you have the lucene code that I used to test the behavior of the lucene
NGramTokenizer:

public static void main(String[] args) throws ParseException,
CorruptIndexException, LockObtainFailedException, IOException {

Analyzer n = new Analyzer() {

@Override
public TokenStream tokenStream(String s, Reader reader) {
TokenStream result = new NGramTokenizer(reader,2,2);
result = new LowerCaseFilter(result);
return result;
}

};

IndexWriter writer = new IndexWriter("sample_index", n);
Document doc = new Document();
Field f = new Field("title", new StringReader("nice dog"));
doc.add(f);
writer.addDocument(doc);
writer.close();

IndexSearcher is = new IndexSearcher("sample_index");

QueryParser qp = new QueryParser("", n);
Query parse = qp.parse("title:dog");

Hits hits = is.search(parse);

System.out.println(hits.length());
System.out.println(parse.toString());
   }


Thanks!!!

Jonathan


Re: Can I add field compression without reindexing?

2008-06-25 Thread Mike Klaas


On 24-Jun-08, at 4:26 PM, Chris Harris wrote:


I have an index that I eventually want to rebuild so I can set
compressed=true on a couple of fields. It's not really practical to  
rebuild

the whole thing right now, though. If I change my schema.xml to set
compressed=true and then keep adding new data to the existing index,  
will

this corrupt the index, or will the *new* data be stored in compressed
format, even while the old data is not compressed?


Hi Chris,

Yes, this should work without problems.

cheers,
-Mike


Re: DataImportHandler running out of memory

2008-06-25 Thread wojtekpia

It looks like that was the problem. With responseBuffering=adaptive, I'm able
to load all my data using the sqljdbc driver.
-- 
View this message in context: 
http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18119732.html
Sent from the Solr - User mailing list archive at Nabble.com.



Sorting questions

2008-06-25 Thread Yugang Hu

Hi,

I have the same issue as described in: 
http://www.nabble.com/solr-sorting-question-td17498596.html. I am trying 
to have some categories before others in search results for different 
search terms. For example, for search team "ABC", I want to show 
Category "CCC" first, then Category "BBB", "AAA", "DDD" and for 
search team "CBA", I want to show Category "DDD" first, then Category 
"CCC", "AAA", "BBB"... 


Is this possible in Solr? Has someone done this before?

Any help will be appreciated.

Thanks,

Yugang




Re: Lucene 2.4-dev source ?

2008-06-25 Thread Grant Ingersoll
Note, also, that the Manifest file in the JAR has information about  
the exact SVN revision so that you can check it out from there.



On Jun 25, 2008, at 12:37 PM, Yonik Seeley wrote:


trunk is the latest version (which is currently 2.4-dev).
http://svn.apache.org/viewvc/lucene/java/trunk/

There is a contrib directory with things not in lucene-core:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/

-Yonik




Re: Lucene 2.4-dev source ?

2008-06-25 Thread Norberto Meijome
On Wed, 25 Jun 2008 20:22:06 -0400
Grant Ingersoll <[EMAIL PROTECTED]> wrote:

> Note, also, that the Manifest file in the JAR has information about  
> the exact SVN revision so that you can check it out from there.
> 
> 
> On Jun 25, 2008, at 12:37 PM, Yonik Seeley wrote:
> 
> > trunk is the latest version (which is currently 2.4-dev).
> > http://svn.apache.org/viewvc/lucene/java/trunk/
> >
> > There is a contrib directory with things not in lucene-core:
> > http://svn.apache.org/viewvc/lucene/java/trunk/contrib/
> >
> > -Yonik
> 

Great stuff, thanks Yonik, Grant!!

_
{Beto|Norberto|Numard} Meijome

"There is no limit to what a man can do or how far he can go if he doesn't mind 
who gets the credit."
   Robert Woodruff

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: NGramTokenizer issue

2008-06-25 Thread Norberto Meijome
On Wed, 25 Jun 2008 15:37:09 -0300
"Jonathan Ariel" <[EMAIL PROTECTED]> wrote:

> I've been trying to use the NGramTokenizer and I ran into a problem.
> It seems like solr is trying to match documents with all the tokens that the
> analyzer returns from the query term. So if I index a document with a title
> field with the value "nice dog" and search for "dog" (where the
> NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't get
> any results.

Hi Jonathan,
I don't have the expertise yet to have gone straight into testing code with
lucene, but my 'black box' testing with ngramtokenizer seems to agree with what
you found - see my latest posts over the last couple of days about this.

Have you tried searching for 'do' or 'ni' or any search term with size =
minGramSize ? I've found that Solr matches results just fine then.

> I can see in the Analysis tool that the tokenizer generates the right
> tokens, but then when solr searches it tries to match the exact Phrase
> instead of the tokens.

+1

B

_
{Beto|Norberto|Numard} Meijome

"Some cause happiness wherever they go; others, whenever they go."
  Oscar Wilde

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Solr 1.3 deletes not working?

2008-06-25 Thread Galen Pahlke
Hi everyone,
I'm having trouble deleting documents from my solr 1.3 index.  To delete a
document, I post something like "12345" to the
solr server, then issue a commit.  However, I can still find the document in
the index via the query "id:12345".  The document remains visible even after
I restart the solr server.  I know the server is receiving my delete
commands, since deletesById goes up on the stats page, but docsDeleted stays
at 0.

I've tried this with svn revisions 661499 and 671649, with the same results,
but these steps worked fine in solr 1.2.  Any ideas?

- Galen Pahlke


Re: Solr 1.3 deletes not working?

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 8:44 PM, Galen Pahlke <[EMAIL PROTECTED]> wrote:
> I'm having trouble deleting documents from my solr 1.3 index.  To delete a
> document, I post something like "12345" to the
> solr server, then issue a commit.  However, I can still find the document in
> the index via the query "id:12345".

That's strange  there are unit tests for this, and I just verified
it worked on the example data.
Perhaps the schema no longer matches what you indexed (or did you re-index?)
Make sure the uniqueKeyField specifies "id".

>  The document remains visible even after
> I restart the solr server.  I know the server is receiving my delete
> commands, since deletesById goes up on the stats page, but docsDeleted stays
> at 0.

docsDeleted is no longer tracked since Lucene now handles the document
overwriting itself.
It should probably be removed.

-Yonik


Re: Sorting questions

2008-06-25 Thread Yonik Seeley
It's not exactly what you want, but putting specific documents first
for certain queries has been done via
http://wiki.apache.org/solr/QueryElevationComponent

-Yonik

On Wed, Jun 25, 2008 at 6:58 PM, Yugang Hu <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have the same issue as described in:
> http://www.nabble.com/solr-sorting-question-td17498596.html. I am trying to
> have some categories before others in search results for different search
> terms. For example, for search team "ABC", I want to show Category "CCC"
> first, then Category "BBB", "AAA", "DDD" and for search team "CBA", I
> want to show Category "DDD" first, then Category "CCC", "AAA", "BBB"...
> Is this possible in Solr? Has someone done this before?
>
> Any help will be appreciated.
>
> Thanks,
>
> Yugang
>
>
>


Re: Solr 1.3 deletes not working?

2008-06-25 Thread Galen Pahlke
I originally tested with an index generated by solr 1.2, but when that
didn't work, I rebuilt the index from scratch.
>From my schema.xml:


   .
   
   .


id


-Galen Pahlke

On Wed, Jun 25, 2008 at 7:00 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> On Wed, Jun 25, 2008 at 8:44 PM, Galen Pahlke <[EMAIL PROTECTED]> wrote:
> > I'm having trouble deleting documents from my solr 1.3 index.  To delete
> a
> > document, I post something like "12345" to the
> > solr server, then issue a commit.  However, I can still find the document
> in
> > the index via the query "id:12345".
>
> That's strange  there are unit tests for this, and I just verified
> it worked on the example data.
> Perhaps the schema no longer matches what you indexed (or did you
> re-index?)
> Make sure the uniqueKeyField specifies "id".
>
> >  The document remains visible even after
> > I restart the solr server.  I know the server is receiving my delete
> > commands, since deletesById goes up on the stats page, but docsDeleted
> stays
> > at 0.
>
> docsDeleted is no longer tracked since Lucene now handles the document
> overwriting itself.
> It should probably be removed.
>
> -Yonik
>


Re: Solr 1.3 deletes not working?

2008-06-25 Thread Yonik Seeley
On Wed, Jun 25, 2008 at 9:34 PM, Galen Pahlke <[EMAIL PROTECTED]> wrote:
> I originally tested with an index generated by solr 1.2, but when that
> didn't work, I rebuilt the index from scratch.
> From my schema.xml:
>
> 
>   .
>required="true"/>
>   .
> 
>
> id

I tried this as well... changing the example schema id type to
integer, adding a document and deleting it.  Everything worked fine.

Something to watch out for: when you indexed the data, could it have
had spaces in the id or something?

If you can't figure it out, try reproducing it in a simple example
that can be added to a JIRA issue.

-Yonik


Re: NGramTokenizer issue

2008-06-25 Thread Jonathan Ariel
Well, it is working if I search just two letters, but that just tells me
that something is wrong somewhere.
The Analysis tools is showing me how "dog" is being tokenized to "do og", so
if when indexing and querying I'm using the same tokenizer/filters (which is
my case) I should get results even when searching "dog".

I've just created a small unit test in solr to try that out.

public void testNGram() throws IOException, Exception {
assertU("adding doc with ngram field",adoc("id", "42", "text_ngram",
"nice dog"));
assertU("commiting",commit());

assertQ("test query, expect one document",
req("text_ngram:dog")
,"//[EMAIL PROTECTED]'1']"
);
}

As you can see I'm adding a document with the field text_ngram with the
value "nice dog".
Then I commit it and query it for "text_ngram:dog".

text_ngram is defined in the schema as:

  


  
  


  


This test passes. That means that I am able to get results when searching
"dog" on a ngram field, where min and max are set to 2 and where the value
of that field is "nice dog".
So it doesn't seems to be a issue in solr, although I am having this error
when using solr outside the unit test. It seems very improbable to think on
an environment issue.

Maybe I am doing something wrong. Any thoughts on that?

Thanks!

Jonathan

On Wed, Jun 25, 2008 at 9:44 PM, Norberto Meijome <[EMAIL PROTECTED]>
wrote:

> On Wed, 25 Jun 2008 15:37:09 -0300
> "Jonathan Ariel" <[EMAIL PROTECTED]> wrote:
>
> > I've been trying to use the NGramTokenizer and I ran into a problem.
> > It seems like solr is trying to match documents with all the tokens that
> the
> > analyzer returns from the query term. So if I index a document with a
> title
> > field with the value "nice dog" and search for "dog" (where the
> > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't
> get
> > any results.
>
> Hi Jonathan,
> I don't have the expertise yet to have gone straight into testing code with
> lucene, but my 'black box' testing with ngramtokenizer seems to agree with
> what
> you found - see my latest posts over the last couple of days about this.
>
> Have you tried searching for 'do' or 'ni' or any search term with size =
> minGramSize ? I've found that Solr matches results just fine then.
>
> > I can see in the Analysis tool that the tokenizer generates the right
> > tokens, but then when solr searches it tries to match the exact Phrase
> > instead of the tokens.
>
> +1
>
> B
>
> _
> {Beto|Norberto|Numard} Meijome
>
> "Some cause happiness wherever they go; others, whenever they go."
>  Oscar Wilde
>
> I speak for myself, not my employer. Contents may be hot. Slippery when
> wet.
> Reading disclaimers makes you go blind. Writing them is worse. You have
> been
> Warned.
>


Re: NGramTokenizer issue

2008-06-25 Thread Norberto Meijome
On Thu, 26 Jun 2008 10:44:32 +1000
Norberto Meijome <[EMAIL PROTECTED]> wrote:

> On Wed, 25 Jun 2008 15:37:09 -0300
> "Jonathan Ariel" <[EMAIL PROTECTED]> wrote:
> 
> > I've been trying to use the NGramTokenizer and I ran into a problem.
> > It seems like solr is trying to match documents with all the tokens that the
> > analyzer returns from the query term. So if I index a document with a title
> > field with the value "nice dog" and search for "dog" (where the
> > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't get
> > any results.
> 
> Hi Jonathan,
> I don't have the expertise yet to have gone straight into testing code with
> lucene, but my 'black box' testing with ngramtokenizer seems to agree with 
> what
> you found - see my latest posts over the last couple of days about this.
> 
> Have you tried searching for 'do' or 'ni' or any search term with size =
> minGramSize ? I've found that Solr matches results just fine then.

hi there,
I did some more tests with nGramTokenizer ... 

Summary : 
5 tests are shown below, 4 work as expected, 1 fails. In particular, this
failure is  when searching , on a field using the NGramTokenizerFactory, with
minGramSize != maxGramSize ,and length(q) > minGramSize. I've reproduced it with
many several variations of minGramSize and length(q) and terms, both in stored
field and query..

My setup:
1.3 nightly code from 2008-06-25,  FreeBSD 7,  JDK 1.6, Jetty from sample app.

my documents are loaded via csv, 1 field copied with fieldCopy to all the
artist_ngram variants.
Relevant data loaded into documents : "nice dog" "the nice dog canine" "Triumph
The Insult Comic Dog". 
the id field is the same data as string.

I am searching directly on the field with q=field:query , qt=standard.
After each schema or solrconfig change, i stop the service, delete data
directory, start server and post the docs again.

--


  

  




  

  





  

  





-

Test 1: OK
http://localhost:8983/solr/_test_/select?q=artist_ngram2:dog&debugQuery=true&qt=standard

returns all 3 docs as expected.  If i understood your mail correctly Jonathan,
you aren't getting results ? 

Test 2 : OK
http://localhost:8983/solr/_test_/select?q=artist_ngram:dog&debugQuery=true&qt=standard

returns 0 documents as expected. artist_ngram has 4 letters per token, we gave
it 3. 

Same result when searching on artist_var_ngram field for same reasons.

Test 3: OK
http://localhost:8983/solr/_test_/select?q=artist_ngram2:insul&debugQuery=true&qt=standard

Returns 1 doc , "Triumph The Insult Comic Dog" as expected. query gets
tokenized into 2-letter tokens and match tokens in index.

same result when searching on artist_ngram field , same reasons (except that we
get 4 char tokens out of the 5 char query)

Test 4 : FAIL!!
http://localhost:8983/solr/_test_/select?q=artist_var_ngram:insul&debugQuery=true&qt=standard

Returns 0 docs. I think it should have matched the same doc as in Test 3,
because the query would be tokenized into 4 and 5 char tokens - all of which
are included in the index as the field is tokenized with all the range between
2 and 10 chars. 
Using Luke (the java app, not the filter), the field shows the tokens shown
after my signature.
Using analysis.jsp, it shows that we should get a match in several tokens.

The query is parsed as follows :
[..]

artist_var_ngram:insul
artist_var_ngram:insul
−

PhraseQuery(artist_var_ngram:"in ns su ul ins nsu sul insu nsul insul")

−

artist_var_ngram:"in ns su ul ins nsu sul insu nsul insul"


OldLuceneQParser
[...]


Test 5 : OK
http://localhost:8983/solr/_test_/select?q=artist_var_ngram:ul&debugQuery=true&qt=standard

Searching for a query which won't be tokenized further (ie, its length = 
minGramSize), it works as expected. 


It seems to me there is a problem with matching  on fields where minGramSize != 
maxGramSize . I don't know enough to point to the cause.

In the meantime, I am creating multiple n-gram fields, with growing sizes, min 
== max, and using dismax across the lot... not pretty, but it'll do until I 
understand why 'Test 4' isn't working.

Please let me know if any more info / tests are needed. Or if I should open an 
issue in JIRA.

cheers,
B
_
{Beto|Norberto|Numard} Meijome

"A dream you dream together is reality."
  John Lennon

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

-
tokens for test4 in index, as per luke, field artist_var_ngram


tr, ri, iu, um, mp, 
ph, h ,  t, th, he, 
e ,  i, in, ns, su, 
ul, lt, t ,  c, co, 
om, mi, ic, c ,  d, 
do, og, tri, riu, ium, 
ump, mph, ph , h t,  th, 
the, he , e i,  in, ins, 
nsu, sul, ult, lt , t c, 
 co, com, omi, mic, ic , 
c d,  do, dog, t

Re: DataImportHandler running out of memory

2008-06-25 Thread Noble Paul നോബിള്‍ नोब्ळ्
We must document this information in the wiki.  We never had a chance
to play w/ ms sql server
--Noble

On Thu, Jun 26, 2008 at 12:38 AM, wojtekpia <[EMAIL PROTECTED]> wrote:
>
> It looks like that was the problem. With responseBuffering=adaptive, I'm able
> to load all my data using the sqljdbc driver.
> --
> View this message in context: 
> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18119732.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--Noble Paul


Re: NGramTokenizer issue

2008-06-25 Thread Jonathan Ariel
Ok. Played a bit more with that.
So I had a difference between my unit test and solr. In solr I'm actually
using a solr.RemoveDuplicatesTokenFilterFactory when querying. Tried to add
that to the test, and it fails.
So in my case I think the error is trying to use a
solr.RemoveDuplicatesTokenFilterFactory with a solr.NGramTokenizerFactory. I
don't know why using solr.RemoveDuplicatesTokenFilterFactory generates "do
og dog" for "dog" when not using it will just generate "do og".
Either way I think that when using ngram I shouldn't use
RemoveDuplicatesTokenFilterFactory. Removing duplicates might change the
structure of the word.


On Thu, Jun 26, 2008 at 12:25 AM, Jonathan Ariel <[EMAIL PROTECTED]> wrote:

> Well, it is working if I search just two letters, but that just tells me
> that something is wrong somewhere.
> The Analysis tools is showing me how "dog" is being tokenized to "do og",
> so if when indexing and querying I'm using the same tokenizer/filters (which
> is my case) I should get results even when searching "dog".
>
> I've just created a small unit test in solr to try that out.
>
> public void testNGram() throws IOException, Exception {
> assertU("adding doc with ngram field",adoc("id", "42",
> "text_ngram", "nice dog"));
> assertU("commiting",commit());
>
> assertQ("test query, expect one document",
> req("text_ngram:dog")
> ,"//[EMAIL PROTECTED]'1']"
> );
> }
>
> As you can see I'm adding a document with the field text_ngram with the
> value "nice dog".
> Then I commit it and query it for "text_ngram:dog".
>
> text_ngram is defined in the schema as:
> 
>   
>  minGramSize="2" />
> 
>   
>   
>  minGramSize="2" />
> 
>   
> 
>
> This test passes. That means that I am able to get results when searching
> "dog" on a ngram field, where min and max are set to 2 and where the value
> of that field is "nice dog".
> So it doesn't seems to be a issue in solr, although I am having this error
> when using solr outside the unit test. It seems very improbable to think on
> an environment issue.
>
> Maybe I am doing something wrong. Any thoughts on that?
>
> Thanks!
>
> Jonathan
>
>
> On Wed, Jun 25, 2008 at 9:44 PM, Norberto Meijome <[EMAIL PROTECTED]>
> wrote:
>
>> On Wed, 25 Jun 2008 15:37:09 -0300
>> "Jonathan Ariel" <[EMAIL PROTECTED]> wrote:
>>
>> > I've been trying to use the NGramTokenizer and I ran into a problem.
>> > It seems like solr is trying to match documents with all the tokens that
>> the
>> > analyzer returns from the query term. So if I index a document with a
>> title
>> > field with the value "nice dog" and search for "dog" (where the
>> > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't
>> get
>> > any results.
>>
>> Hi Jonathan,
>> I don't have the expertise yet to have gone straight into testing code
>> with
>> lucene, but my 'black box' testing with ngramtokenizer seems to agree with
>> what
>> you found - see my latest posts over the last couple of days about this.
>>
>> Have you tried searching for 'do' or 'ni' or any search term with size =
>> minGramSize ? I've found that Solr matches results just fine then.
>>
>> > I can see in the Analysis tool that the tokenizer generates the right
>> > tokens, but then when solr searches it tries to match the exact Phrase
>> > instead of the tokens.
>>
>> +1
>>
>> B
>>
>> _
>> {Beto|Norberto|Numard} Meijome
>>
>> "Some cause happiness wherever they go; others, whenever they go."
>>  Oscar Wilde
>>
>> I speak for myself, not my employer. Contents may be hot. Slippery when
>> wet.
>> Reading disclaimers makes you go blind. Writing them is worse. You have
>> been
>> Warned.
>>
>
>


Re: NGramTokenizer issue

2008-06-25 Thread Norberto Meijome
On Thu, 26 Jun 2008 01:15:34 -0300
"Jonathan Ariel" <[EMAIL PROTECTED]> wrote:

> Ok. Played a bit more with that.
> So I had a difference between my unit test and solr. In solr I'm actually
> using a solr.RemoveDuplicatesTokenFilterFactory when querying. Tried to add
> that to the test, and it fails.
> So in my case I think the error is trying to use a
> solr.RemoveDuplicatesTokenFilterFactory with a solr.NGramTokenizerFactory. I
> don't know why using solr.RemoveDuplicatesTokenFilterFactory generates "do
> og dog" for "dog" when not using it will just generate "do og".
> Either way I think that when using ngram I shouldn't use
> RemoveDuplicatesTokenFilterFactory. Removing duplicates might change the
> structure of the word.

Hi Jonathan,
My apologies, i found the issue with removeDuplicates late last night and I 
forgot to mention it. 

The  5 tests i included in my other email don't use removeDuplicate for this 
reason. I am still interested to know why one of them is failing, when 
analysis.jsp + common_sense ;) say it should.

B

_
{Beto|Norberto|Numard} Meijome

Q. How do you make God laugh?
A. Tell him your plans.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.