Solr spell-checker
Hello all, I have Solr 1.2 installed and I was wandering how Solr 1.2 deals with checking miss spelled strings and also how to configure it ? appreciate any docs on this topic .. thanks a lot ak _ All new Live Search at Live.com http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/
Re: Solr spell-checker
Take a look at http://wiki.apache.org/solr/SpellCheckerRequestHandler If you can use a nightly build of Solr 1.3 then you can use the new and better http://wiki.apache.org/solr/SpellCheckComponent On Wed, Jun 25, 2008 at 2:36 PM, dudes dudes <[EMAIL PROTECTED]> wrote: > > Hello all, > > I have Solr 1.2 installed and I was wandering how Solr 1.2 deals with > checking miss spelled strings and also how to configure it ? > appreciate any docs on this topic .. > > thanks a lot > ak > _ > > All new Live Search at Live.com > > http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/ -- Regards, Shalin Shekhar Mangar.
Re: DataImportHandler running out of memory
I think it's a bit different. I ran into this exact problem about two weeks ago on a 13 million record DB. MySQL doesn't honor the fetch size for it's v5 JDBC driver. See http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ or do a search for MySQL fetch size. You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't work) in order to get streaming in MySQL. -Grant On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote: Setting the batchSize to 1 would mean that the Jdbc driver will keep 1 rows in memory *for each entity* which uses that data source (if correctly implemented by the driver). Not sure how well the Sql Server driver implements this. Also keep in mind that Solr also needs memory to index documents. You can probably try setting the batch size to a lower value. The regular memory tuning stuff should apply here too -- try disabling autoCommit and turn-off autowarming and see if it helps. On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote: I'm trying to load ~10 million records into Solr using the DataImportHandler. I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as soon as I try loading more than about 5 million records. Here's my configuration: I'm connecting to a SQL Server database using the sqljdbc driver. I've given my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to 1. My SQL query is "select top XXX field1, ... from table1". I have about 40 fields in my Solr schema. I thought the DataImportHandler would stream data from the DB rather than loading it all into memory at once. Is that not the case? Any thoughts on how to get around this (aside from getting a machine with more memory)? -- View this message in context: http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar. -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: DataImportHandler running out of memory
I'm assuming, of course, that the DIH doesn't automatically modify the SQL statement according to the batch size. -Grant On Jun 25, 2008, at 7:05 AM, Grant Ingersoll wrote: I think it's a bit different. I ran into this exact problem about two weeks ago on a 13 million record DB. MySQL doesn't honor the fetch size for it's v5 JDBC driver. See http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ or do a search for MySQL fetch size. You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't work) in order to get streaming in MySQL. -Grant On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote: Setting the batchSize to 1 would mean that the Jdbc driver will keep 1 rows in memory *for each entity* which uses that data source (if correctly implemented by the driver). Not sure how well the Sql Server driver implements this. Also keep in mind that Solr also needs memory to index documents. You can probably try setting the batch size to a lower value. The regular memory tuning stuff should apply here too -- try disabling autoCommit and turn-off autowarming and see if it helps. On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote: I'm trying to load ~10 million records into Solr using the DataImportHandler. I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as soon as I try loading more than about 5 million records. Here's my configuration: I'm connecting to a SQL Server database using the sqljdbc driver. I've given my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to 1. My SQL query is "select top XXX field1, ... from table1". I have about 40 fields in my Solr schema. I thought the DataImportHandler would stream data from the DB rather than loading it all into memory at once. Is that not the case? Any thoughts on how to get around this (aside from getting a machine with more memory)? -- View this message in context: http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Shalin Shekhar Mangar. -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
RE: Solr spell-checker
thanks for your kind reply > Date: Wed, 25 Jun 2008 14:48:38 +0530 > From: [EMAIL PROTECTED] > To: solr-user@lucene.apache.org > Subject: Re: Solr spell-checker > > Take a look at http://wiki.apache.org/solr/SpellCheckerRequestHandler > > If you can use a nightly build of Solr 1.3 then you can use the new and > better http://wiki.apache.org/solr/SpellCheckComponent > > On Wed, Jun 25, 2008 at 2:36 PM, dudes dudes wrote: > >> >> Hello all, >> >> I have Solr 1.2 installed and I was wandering how Solr 1.2 deals with >> checking miss spelled strings and also how to configure it ? >> appreciate any docs on this topic .. >> >> thanks a lot >> ak >> _ >> >> All new Live Search at Live.com >> >> http://clk.atdmt.com/UKM/go/msnnkmgl001006ukm/direct/01/ > > > > > -- > Regards, > Shalin Shekhar Mangar. _ http://clk.atdmt.com/UKM/go/msnnkmgl001002ukm/direct/01/
Re: DataImportHandler running out of memory
The OP is actually using Sql Server (not MySql) as per his mail. On Wed, Jun 25, 2008 at 4:40 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > I'm assuming, of course, that the DIH doesn't automatically modify the SQL > statement according to the batch size. > > -Grant > > > On Jun 25, 2008, at 7:05 AM, Grant Ingersoll wrote: > > I think it's a bit different. I ran into this exact problem about two >> weeks ago on a 13 million record DB. MySQL doesn't honor the fetch size for >> it's v5 JDBC driver. >> >> See >> http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ or >> do a search for MySQL fetch size. >> >> You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't work) >> in order to get streaming in MySQL. >> >> -Grant >> >> >> On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote: >> >> Setting the batchSize to 1 would mean that the Jdbc driver will keep >>> 1 rows in memory *for each entity* which uses that data source (if >>> correctly implemented by the driver). Not sure how well the Sql Server >>> driver implements this. Also keep in mind that Solr also needs memory to >>> index documents. You can probably try setting the batch size to a lower >>> value. >>> >>> The regular memory tuning stuff should apply here too -- try disabling >>> autoCommit and turn-off autowarming and see if it helps. >>> >>> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote: >>> >>> I'm trying to load ~10 million records into Solr using the DataImportHandler. I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as soon as I try loading more than about 5 million records. Here's my configuration: I'm connecting to a SQL Server database using the sqljdbc driver. I've given my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to 1. My SQL query is "select top XXX field1, ... from table1". I have about 40 fields in my Solr schema. I thought the DataImportHandler would stream data from the DB rather than loading it all into memory at once. Is that not the case? Any thoughts on how to get around this (aside from getting a machine with more memory)? -- View this message in context: http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >>> >> >> -- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> Lucene Helpful Hints: >> http://wiki.apache.org/lucene-java/BasicsOfPerformance >> http://wiki.apache.org/lucene-java/LuceneFAQ >> >> >> >> >> >> >> >> > -- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > -- Regards, Shalin Shekhar Mangar.
Re: DataImportHandler running out of memory
DIH does not modify SQL. This value is used as a connection property --Noble On Wed, Jun 25, 2008 at 4:40 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > I'm assuming, of course, that the DIH doesn't automatically modify the SQL > statement according to the batch size. > > -Grant > > On Jun 25, 2008, at 7:05 AM, Grant Ingersoll wrote: > >> I think it's a bit different. I ran into this exact problem about two >> weeks ago on a 13 million record DB. MySQL doesn't honor the fetch size for >> it's v5 JDBC driver. >> >> See >> http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ or >> do a search for MySQL fetch size. >> >> You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't work) >> in order to get streaming in MySQL. >> >> -Grant >> >> >> On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote: >> >>> Setting the batchSize to 1 would mean that the Jdbc driver will keep >>> 1 rows in memory *for each entity* which uses that data source (if >>> correctly implemented by the driver). Not sure how well the Sql Server >>> driver implements this. Also keep in mind that Solr also needs memory to >>> index documents. You can probably try setting the batch size to a lower >>> value. >>> >>> The regular memory tuning stuff should apply here too -- try disabling >>> autoCommit and turn-off autowarming and see if it helps. >>> >>> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote: >>> I'm trying to load ~10 million records into Solr using the DataImportHandler. I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) as soon as I try loading more than about 5 million records. Here's my configuration: I'm connecting to a SQL Server database using the sqljdbc driver. I've given my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to 1. My SQL query is "select top XXX field1, ... from table1". I have about 40 fields in my Solr schema. I thought the DataImportHandler would stream data from the DB rather than loading it all into memory at once. Is that not the case? Any thoughts on how to get around this (aside from getting a machine with more memory)? -- View this message in context: http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >> >> -- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> Lucene Helpful Hints: >> http://wiki.apache.org/lucene-java/BasicsOfPerformance >> http://wiki.apache.org/lucene-java/LuceneFAQ >> >> >> >> >> >> >> > > -- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > -- --Noble Paul
Re: DataImportHandler running out of memory
The latest patch sets fetchSize as Integer.MIN_VALUE if -1 is passed. It is added specifically for mysql driver --Noble On Wed, Jun 25, 2008 at 4:35 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > I think it's a bit different. I ran into this exact problem about two weeks > ago on a 13 million record DB. MySQL doesn't honor the fetch size for it's > v5 JDBC driver. > > See > http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ or > do a search for MySQL fetch size. > > You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't work) in > order to get streaming in MySQL. > > -Grant > > > On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote: > >> Setting the batchSize to 1 would mean that the Jdbc driver will keep >> 1 rows in memory *for each entity* which uses that data source (if >> correctly implemented by the driver). Not sure how well the Sql Server >> driver implements this. Also keep in mind that Solr also needs memory to >> index documents. You can probably try setting the batch size to a lower >> value. >> >> The regular memory tuning stuff should apply here too -- try disabling >> autoCommit and turn-off autowarming and see if it helps. >> >> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote: >> >>> >>> I'm trying to load ~10 million records into Solr using the >>> DataImportHandler. >>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) >>> as >>> soon as I try loading more than about 5 million records. >>> >>> Here's my configuration: >>> I'm connecting to a SQL Server database using the sqljdbc driver. I've >>> given >>> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize to >>> 1. My SQL query is "select top XXX field1, ... from table1". I have >>> about 40 fields in my Solr schema. >>> >>> I thought the DataImportHandler would stream data from the DB rather than >>> loading it all into memory at once. Is that not the case? Any thoughts on >>> how to get around this (aside from getting a machine with more memory)? >>> >>> -- >>> View this message in context: >>> >>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. > > -- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ
Re: How to debug ?
On Tue, 24 Jun 2008 19:17:58 -0700 Ryan McKinley <[EMAIL PROTECTED]> wrote: > also, check the LukeRequestHandler > > if there is a document you think *should* match, you can see what > tokens it has actually indexed... > hi Ryan, I can't see the tokens generated using LukeRequestHandler. I can get to the document I want : http://localhost:8983/solr/_test_/admin/luke/?id=Jay%20Rock and for the field I am interested , i get only : [...] ngram ITS-- ITS-- Jay Rock Jay Rock 1.0 0 [...] ( all the other fields look pretty much identical , none of them show the tokens generated). using the luke tool itself ( lukeall.jar ,source # 0.8.1, linked against Lucene's 2.4 libs bundled with the nightly build), I see the following tokens, for this document + field: ja, ay, y , r, ro, oc, ck, jay, ay , y r, ro, roc, ock, jay , ay r, y ro, roc, rock, jay r, ay ro, y roc, rock, jay ro, ay roc, y rock, jay roc, ay rock, jay rock Which is precisely what I expect, given that my 'ngram' type is defined as : My question now is, was I supposed to get any more information from LukeRequestHandler ? furthermore, if I perform , on this same core with exactly this data : http://localhost:8983/solr/_test_/select?q=artist_ngram:ro I get this document returned (and many others). but, if I search for 'roc' instead of 'ro' : http://localhost:8983/solr/_test_/select?q=artist_ngram:roc − − 0 48 − artist_ngram:roc true − artist_ngram:roc artist_ngram:roc PhraseQuery(artist_ngram:"ro oc roc") artist_ngram:"ro oc roc" OldLuceneQParser − .[...] Is searching on nGram tokenized fields limited to the minGramSize ? Thanks for any pointers you can provide, B _ {Beto|Norberto|Numard} Meijome "I didn't attend the funeral, but I sent a nice letter saying I approved of it." Mark Twain I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Lucene 2.4-dev source ?
Hi, where can I find these sources? I have the binary jars included with the nightly builds,but I'd like to look @ the code of some of the objects. In particular, http://svn.apache.org/viewvc/lucene/java/ doesnt have any reference to 2.4, and http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/analysis/ doesn't include org.apache.lucene.analysis.ngram.NGramTokenFilter ,which is one of what I am after... thanks! B _ {Beto|Norberto|Numard} Meijome Real Programmers don't comment their code. If it was hard to write, it should be hard to understand and even harder to modify. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Lucene 2.4-dev source ?
trunk is the latest version (which is currently 2.4-dev). http://svn.apache.org/viewvc/lucene/java/trunk/ There is a contrib directory with things not in lucene-core: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/ -Yonik
Re: DataImportHandler running out of memory
I'm trying with batchSize=-1 now. So far it seems to be working, but very slowly. I will update when it completes or crashes. Even with a batchSize of 100 I was running out of memory. I'm running on a 32-bit Windows machine. I've set the -Xmx to 1.5 GB - I believe that's the maximum for my environment. The batchSize parameter doesn't seem to control what happens... when I select top 5,000,000 with a batchSize of 10,000, it works. When I select top 10,000,000 with the same batchSize, it runs out of memory. Also, I'm using the 469 patch posted on 2008-06-11 08:41 AM. Noble Paul നോബിള് नोब्ळ् wrote: > > DIH streams rows one by one. > set the fetchSize="-1" this might help. It may make the indexing a bit > slower but memory consumption would be low. > The memory is consumed by the jdbc driver. try tuning the -Xmx value for > the VM > --Noble > > On Wed, Jun 25, 2008 at 8:05 AM, Shalin Shekhar Mangar > <[EMAIL PROTECTED]> wrote: >> Setting the batchSize to 1 would mean that the Jdbc driver will keep >> 1 rows in memory *for each entity* which uses that data source (if >> correctly implemented by the driver). Not sure how well the Sql Server >> driver implements this. Also keep in mind that Solr also needs memory to >> index documents. You can probably try setting the batch size to a lower >> value. >> >> The regular memory tuning stuff should apply here too -- try disabling >> autoCommit and turn-off autowarming and see if it helps. >> >> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> wrote: >> >>> >>> I'm trying to load ~10 million records into Solr using the >>> DataImportHandler. >>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) >>> as >>> soon as I try loading more than about 5 million records. >>> >>> Here's my configuration: >>> I'm connecting to a SQL Server database using the sqljdbc driver. I've >>> given >>> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize >>> to >>> 1. My SQL query is "select top XXX field1, ... from table1". I have >>> about 40 fields in my Solr schema. >>> >>> I thought the DataImportHandler would stream data from the DB rather >>> than >>> loading it all into memory at once. Is that not the case? Any thoughts >>> on >>> how to get around this (aside from getting a machine with more memory)? >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> > > > > -- > --Noble Paul > > -- View this message in context: http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18115900.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DataImportHandler running out of memory
Hi, I don't think the problem is within DataImportHandler since it just streams resultset. The fetchSize is just passed as a parameter passed to Statement#setFetchSize() and the Jdbc driver is supposed to honor it and keep only that many rows in memory. From what I could find about the Sql Server driver -- there's a connection property called responseBuffering whose default value is "full" which causes the entire result set is fetched. See http://msdn.microsoft.com/en-us/library/ms378988.aspx for more details. You can set connection properties like this directly in the jdbc url specified in DataImportHandler's dataSource configuration. On Wed, Jun 25, 2008 at 10:17 PM, wojtekpia <[EMAIL PROTECTED]> wrote: > > I'm trying with batchSize=-1 now. So far it seems to be working, but very > slowly. I will update when it completes or crashes. > > Even with a batchSize of 100 I was running out of memory. > > I'm running on a 32-bit Windows machine. I've set the -Xmx to 1.5 GB - I > believe that's the maximum for my environment. > > The batchSize parameter doesn't seem to control what happens... when I > select top 5,000,000 with a batchSize of 10,000, it works. When I select > top > 10,000,000 with the same batchSize, it runs out of memory. > > Also, I'm using the 469 patch posted on 2008-06-11 08:41 AM. > > > Noble Paul നോബിള് नोब्ळ् wrote: > > > > DIH streams rows one by one. > > set the fetchSize="-1" this might help. It may make the indexing a bit > > slower but memory consumption would be low. > > The memory is consumed by the jdbc driver. try tuning the -Xmx value for > > the VM > > --Noble > > > > On Wed, Jun 25, 2008 at 8:05 AM, Shalin Shekhar Mangar > > <[EMAIL PROTECTED]> wrote: > >> Setting the batchSize to 1 would mean that the Jdbc driver will keep > >> 1 rows in memory *for each entity* which uses that data source (if > >> correctly implemented by the driver). Not sure how well the Sql Server > >> driver implements this. Also keep in mind that Solr also needs memory to > >> index documents. You can probably try setting the batch size to a lower > >> value. > >> > >> The regular memory tuning stuff should apply here too -- try disabling > >> autoCommit and turn-off autowarming and see if it helps. > >> > >> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> > wrote: > >> > >>> > >>> I'm trying to load ~10 million records into Solr using the > >>> DataImportHandler. > >>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap space) > >>> as > >>> soon as I try loading more than about 5 million records. > >>> > >>> Here's my configuration: > >>> I'm connecting to a SQL Server database using the sqljdbc driver. I've > >>> given > >>> my Solr instance 1.5 GB of memory. I have set the dataSource batchSize > >>> to > >>> 1. My SQL query is "select top XXX field1, ... from table1". I have > >>> about 40 fields in my Solr schema. > >>> > >>> I thought the DataImportHandler would stream data from the DB rather > >>> than > >>> loading it all into memory at once. Is that not the case? Any thoughts > >>> on > >>> how to get around this (aside from getting a machine with more memory)? > >>> > >>> -- > >>> View this message in context: > >>> > http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html > >>> Sent from the Solr - User mailing list archive at Nabble.com. > >>> > >>> > >> > >> > >> -- > >> Regards, > >> Shalin Shekhar Mangar. > >> > > > > > > > > -- > > --Noble Paul > > > > > > -- > View this message in context: > http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18115900.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Regards, Shalin Shekhar Mangar.
Re: "document commit" possible?
: With the understanding that queries for newly indexed fields in this document : will not return this newly added document, but a query for the document by its : id will return any new stored fields. When the "real" commit (read: the commit : that takes 10 minutes to complete) returns the newly indexed fields will be : query-able. this would be ... non-trivial. Right now stored fields and indexed fields are all handled using Lucene. I seem to recall some discussions on solr-dev a while back of adding some alternate field store mechanisms, in which case i can *imagine* a field store that would be could immediately on "add" ... but it's just theoretical. -Hoss
Re: Nutch <-> Solr latest?
: Im curious, is there a spot / patch for the latest on Nutch / Solr : integration, Ive found a few pages (a few outdated it seems), it would be nice : (?) if it worked as a DataSource type to DataImportHandler, but not sure if : that fits w/ how it works. Either way a nice contrib patch the way the DIH is : already setup would be nice to have. ... : Is there currently work ongoing on this? Seems like it belongs in either / or : project and not both. My understanding is that previous wok on bridging Nutch crawling with Solr indexing involved patching Nutch and using a Nutch specific schema.xml and the client code which has since been committed as "SolrJ". Most of the discussion seemed to take place on the Nutch list (which makes sense since Nutch required the patching) so you may wnt to start there). I'm not sure if Nutch itegration would make sense as a DIH plugin (it seems like the Nutch crawler could "push" the data much more easily then DIH could pull it from the crawler) but if there is any advantage to having plugin code running in Solr to support this then that would absolutely make sense in the new /contrib area of solr (that i believe Otis already created/commited) but any nutch "plugins" or modifications would obviously need to be made in Nutch. -Hoss
NGramTokenizer issue
Hi, I've been trying to use the NGramTokenizer and I ran into a problem. It seems like solr is trying to match documents with all the tokens that the analyzer returns from the query term. So if I index a document with a title field with the value "nice dog" and search for "dog" (where the NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't get any results. I can see in the Analysis tool that the tokenizer generates the right tokens, but then when solr searches it tries to match the exact Phrase instead of the tokens. I tried the same in Lucene and it works as expected. So it seems to be a Solr issue. Any hint of where should I look in order to fix it? Here you have the lucene code that I used to test the behavior of the lucene NGramTokenizer: public static void main(String[] args) throws ParseException, CorruptIndexException, LockObtainFailedException, IOException { Analyzer n = new Analyzer() { @Override public TokenStream tokenStream(String s, Reader reader) { TokenStream result = new NGramTokenizer(reader,2,2); result = new LowerCaseFilter(result); return result; } }; IndexWriter writer = new IndexWriter("sample_index", n); Document doc = new Document(); Field f = new Field("title", new StringReader("nice dog")); doc.add(f); writer.addDocument(doc); writer.close(); IndexSearcher is = new IndexSearcher("sample_index"); QueryParser qp = new QueryParser("", n); Query parse = qp.parse("title:dog"); Hits hits = is.search(parse); System.out.println(hits.length()); System.out.println(parse.toString()); } Thanks!!! Jonathan
Re: Can I add field compression without reindexing?
On 24-Jun-08, at 4:26 PM, Chris Harris wrote: I have an index that I eventually want to rebuild so I can set compressed=true on a couple of fields. It's not really practical to rebuild the whole thing right now, though. If I change my schema.xml to set compressed=true and then keep adding new data to the existing index, will this corrupt the index, or will the *new* data be stored in compressed format, even while the old data is not compressed? Hi Chris, Yes, this should work without problems. cheers, -Mike
Re: DataImportHandler running out of memory
It looks like that was the problem. With responseBuffering=adaptive, I'm able to load all my data using the sqljdbc driver. -- View this message in context: http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18119732.html Sent from the Solr - User mailing list archive at Nabble.com.
Sorting questions
Hi, I have the same issue as described in: http://www.nabble.com/solr-sorting-question-td17498596.html. I am trying to have some categories before others in search results for different search terms. For example, for search team "ABC", I want to show Category "CCC" first, then Category "BBB", "AAA", "DDD" and for search team "CBA", I want to show Category "DDD" first, then Category "CCC", "AAA", "BBB"... Is this possible in Solr? Has someone done this before? Any help will be appreciated. Thanks, Yugang
Re: Lucene 2.4-dev source ?
Note, also, that the Manifest file in the JAR has information about the exact SVN revision so that you can check it out from there. On Jun 25, 2008, at 12:37 PM, Yonik Seeley wrote: trunk is the latest version (which is currently 2.4-dev). http://svn.apache.org/viewvc/lucene/java/trunk/ There is a contrib directory with things not in lucene-core: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/ -Yonik
Re: Lucene 2.4-dev source ?
On Wed, 25 Jun 2008 20:22:06 -0400 Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Note, also, that the Manifest file in the JAR has information about > the exact SVN revision so that you can check it out from there. > > > On Jun 25, 2008, at 12:37 PM, Yonik Seeley wrote: > > > trunk is the latest version (which is currently 2.4-dev). > > http://svn.apache.org/viewvc/lucene/java/trunk/ > > > > There is a contrib directory with things not in lucene-core: > > http://svn.apache.org/viewvc/lucene/java/trunk/contrib/ > > > > -Yonik > Great stuff, thanks Yonik, Grant!! _ {Beto|Norberto|Numard} Meijome "There is no limit to what a man can do or how far he can go if he doesn't mind who gets the credit." Robert Woodruff I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: NGramTokenizer issue
On Wed, 25 Jun 2008 15:37:09 -0300 "Jonathan Ariel" <[EMAIL PROTECTED]> wrote: > I've been trying to use the NGramTokenizer and I ran into a problem. > It seems like solr is trying to match documents with all the tokens that the > analyzer returns from the query term. So if I index a document with a title > field with the value "nice dog" and search for "dog" (where the > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't get > any results. Hi Jonathan, I don't have the expertise yet to have gone straight into testing code with lucene, but my 'black box' testing with ngramtokenizer seems to agree with what you found - see my latest posts over the last couple of days about this. Have you tried searching for 'do' or 'ni' or any search term with size = minGramSize ? I've found that Solr matches results just fine then. > I can see in the Analysis tool that the tokenizer generates the right > tokens, but then when solr searches it tries to match the exact Phrase > instead of the tokens. +1 B _ {Beto|Norberto|Numard} Meijome "Some cause happiness wherever they go; others, whenever they go." Oscar Wilde I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Solr 1.3 deletes not working?
Hi everyone, I'm having trouble deleting documents from my solr 1.3 index. To delete a document, I post something like "12345" to the solr server, then issue a commit. However, I can still find the document in the index via the query "id:12345". The document remains visible even after I restart the solr server. I know the server is receiving my delete commands, since deletesById goes up on the stats page, but docsDeleted stays at 0. I've tried this with svn revisions 661499 and 671649, with the same results, but these steps worked fine in solr 1.2. Any ideas? - Galen Pahlke
Re: Solr 1.3 deletes not working?
On Wed, Jun 25, 2008 at 8:44 PM, Galen Pahlke <[EMAIL PROTECTED]> wrote: > I'm having trouble deleting documents from my solr 1.3 index. To delete a > document, I post something like "12345" to the > solr server, then issue a commit. However, I can still find the document in > the index via the query "id:12345". That's strange there are unit tests for this, and I just verified it worked on the example data. Perhaps the schema no longer matches what you indexed (or did you re-index?) Make sure the uniqueKeyField specifies "id". > The document remains visible even after > I restart the solr server. I know the server is receiving my delete > commands, since deletesById goes up on the stats page, but docsDeleted stays > at 0. docsDeleted is no longer tracked since Lucene now handles the document overwriting itself. It should probably be removed. -Yonik
Re: Sorting questions
It's not exactly what you want, but putting specific documents first for certain queries has been done via http://wiki.apache.org/solr/QueryElevationComponent -Yonik On Wed, Jun 25, 2008 at 6:58 PM, Yugang Hu <[EMAIL PROTECTED]> wrote: > Hi, > > I have the same issue as described in: > http://www.nabble.com/solr-sorting-question-td17498596.html. I am trying to > have some categories before others in search results for different search > terms. For example, for search team "ABC", I want to show Category "CCC" > first, then Category "BBB", "AAA", "DDD" and for search team "CBA", I > want to show Category "DDD" first, then Category "CCC", "AAA", "BBB"... > Is this possible in Solr? Has someone done this before? > > Any help will be appreciated. > > Thanks, > > Yugang > > >
Re: Solr 1.3 deletes not working?
I originally tested with an index generated by solr 1.2, but when that didn't work, I rebuilt the index from scratch. >From my schema.xml: . . id -Galen Pahlke On Wed, Jun 25, 2008 at 7:00 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Wed, Jun 25, 2008 at 8:44 PM, Galen Pahlke <[EMAIL PROTECTED]> wrote: > > I'm having trouble deleting documents from my solr 1.3 index. To delete > a > > document, I post something like "12345" to the > > solr server, then issue a commit. However, I can still find the document > in > > the index via the query "id:12345". > > That's strange there are unit tests for this, and I just verified > it worked on the example data. > Perhaps the schema no longer matches what you indexed (or did you > re-index?) > Make sure the uniqueKeyField specifies "id". > > > The document remains visible even after > > I restart the solr server. I know the server is receiving my delete > > commands, since deletesById goes up on the stats page, but docsDeleted > stays > > at 0. > > docsDeleted is no longer tracked since Lucene now handles the document > overwriting itself. > It should probably be removed. > > -Yonik >
Re: Solr 1.3 deletes not working?
On Wed, Jun 25, 2008 at 9:34 PM, Galen Pahlke <[EMAIL PROTECTED]> wrote: > I originally tested with an index generated by solr 1.2, but when that > didn't work, I rebuilt the index from scratch. > From my schema.xml: > > > . >required="true"/> > . > > > id I tried this as well... changing the example schema id type to integer, adding a document and deleting it. Everything worked fine. Something to watch out for: when you indexed the data, could it have had spaces in the id or something? If you can't figure it out, try reproducing it in a simple example that can be added to a JIRA issue. -Yonik
Re: NGramTokenizer issue
Well, it is working if I search just two letters, but that just tells me that something is wrong somewhere. The Analysis tools is showing me how "dog" is being tokenized to "do og", so if when indexing and querying I'm using the same tokenizer/filters (which is my case) I should get results even when searching "dog". I've just created a small unit test in solr to try that out. public void testNGram() throws IOException, Exception { assertU("adding doc with ngram field",adoc("id", "42", "text_ngram", "nice dog")); assertU("commiting",commit()); assertQ("test query, expect one document", req("text_ngram:dog") ,"//[EMAIL PROTECTED]'1']" ); } As you can see I'm adding a document with the field text_ngram with the value "nice dog". Then I commit it and query it for "text_ngram:dog". text_ngram is defined in the schema as: This test passes. That means that I am able to get results when searching "dog" on a ngram field, where min and max are set to 2 and where the value of that field is "nice dog". So it doesn't seems to be a issue in solr, although I am having this error when using solr outside the unit test. It seems very improbable to think on an environment issue. Maybe I am doing something wrong. Any thoughts on that? Thanks! Jonathan On Wed, Jun 25, 2008 at 9:44 PM, Norberto Meijome <[EMAIL PROTECTED]> wrote: > On Wed, 25 Jun 2008 15:37:09 -0300 > "Jonathan Ariel" <[EMAIL PROTECTED]> wrote: > > > I've been trying to use the NGramTokenizer and I ran into a problem. > > It seems like solr is trying to match documents with all the tokens that > the > > analyzer returns from the query term. So if I index a document with a > title > > field with the value "nice dog" and search for "dog" (where the > > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't > get > > any results. > > Hi Jonathan, > I don't have the expertise yet to have gone straight into testing code with > lucene, but my 'black box' testing with ngramtokenizer seems to agree with > what > you found - see my latest posts over the last couple of days about this. > > Have you tried searching for 'do' or 'ni' or any search term with size = > minGramSize ? I've found that Solr matches results just fine then. > > > I can see in the Analysis tool that the tokenizer generates the right > > tokens, but then when solr searches it tries to match the exact Phrase > > instead of the tokens. > > +1 > > B > > _ > {Beto|Norberto|Numard} Meijome > > "Some cause happiness wherever they go; others, whenever they go." > Oscar Wilde > > I speak for myself, not my employer. Contents may be hot. Slippery when > wet. > Reading disclaimers makes you go blind. Writing them is worse. You have > been > Warned. >
Re: NGramTokenizer issue
On Thu, 26 Jun 2008 10:44:32 +1000 Norberto Meijome <[EMAIL PROTECTED]> wrote: > On Wed, 25 Jun 2008 15:37:09 -0300 > "Jonathan Ariel" <[EMAIL PROTECTED]> wrote: > > > I've been trying to use the NGramTokenizer and I ran into a problem. > > It seems like solr is trying to match documents with all the tokens that the > > analyzer returns from the query term. So if I index a document with a title > > field with the value "nice dog" and search for "dog" (where the > > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't get > > any results. > > Hi Jonathan, > I don't have the expertise yet to have gone straight into testing code with > lucene, but my 'black box' testing with ngramtokenizer seems to agree with > what > you found - see my latest posts over the last couple of days about this. > > Have you tried searching for 'do' or 'ni' or any search term with size = > minGramSize ? I've found that Solr matches results just fine then. hi there, I did some more tests with nGramTokenizer ... Summary : 5 tests are shown below, 4 work as expected, 1 fails. In particular, this failure is when searching , on a field using the NGramTokenizerFactory, with minGramSize != maxGramSize ,and length(q) > minGramSize. I've reproduced it with many several variations of minGramSize and length(q) and terms, both in stored field and query.. My setup: 1.3 nightly code from 2008-06-25, FreeBSD 7, JDK 1.6, Jetty from sample app. my documents are loaded via csv, 1 field copied with fieldCopy to all the artist_ngram variants. Relevant data loaded into documents : "nice dog" "the nice dog canine" "Triumph The Insult Comic Dog". the id field is the same data as string. I am searching directly on the field with q=field:query , qt=standard. After each schema or solrconfig change, i stop the service, delete data directory, start server and post the docs again. -- - Test 1: OK http://localhost:8983/solr/_test_/select?q=artist_ngram2:dog&debugQuery=true&qt=standard returns all 3 docs as expected. If i understood your mail correctly Jonathan, you aren't getting results ? Test 2 : OK http://localhost:8983/solr/_test_/select?q=artist_ngram:dog&debugQuery=true&qt=standard returns 0 documents as expected. artist_ngram has 4 letters per token, we gave it 3. Same result when searching on artist_var_ngram field for same reasons. Test 3: OK http://localhost:8983/solr/_test_/select?q=artist_ngram2:insul&debugQuery=true&qt=standard Returns 1 doc , "Triumph The Insult Comic Dog" as expected. query gets tokenized into 2-letter tokens and match tokens in index. same result when searching on artist_ngram field , same reasons (except that we get 4 char tokens out of the 5 char query) Test 4 : FAIL!! http://localhost:8983/solr/_test_/select?q=artist_var_ngram:insul&debugQuery=true&qt=standard Returns 0 docs. I think it should have matched the same doc as in Test 3, because the query would be tokenized into 4 and 5 char tokens - all of which are included in the index as the field is tokenized with all the range between 2 and 10 chars. Using Luke (the java app, not the filter), the field shows the tokens shown after my signature. Using analysis.jsp, it shows that we should get a match in several tokens. The query is parsed as follows : [..] artist_var_ngram:insul artist_var_ngram:insul − PhraseQuery(artist_var_ngram:"in ns su ul ins nsu sul insu nsul insul") − artist_var_ngram:"in ns su ul ins nsu sul insu nsul insul" OldLuceneQParser [...] Test 5 : OK http://localhost:8983/solr/_test_/select?q=artist_var_ngram:ul&debugQuery=true&qt=standard Searching for a query which won't be tokenized further (ie, its length = minGramSize), it works as expected. It seems to me there is a problem with matching on fields where minGramSize != maxGramSize . I don't know enough to point to the cause. In the meantime, I am creating multiple n-gram fields, with growing sizes, min == max, and using dismax across the lot... not pretty, but it'll do until I understand why 'Test 4' isn't working. Please let me know if any more info / tests are needed. Or if I should open an issue in JIRA. cheers, B _ {Beto|Norberto|Numard} Meijome "A dream you dream together is reality." John Lennon I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. - tokens for test4 in index, as per luke, field artist_var_ngram tr, ri, iu, um, mp, ph, h , t, th, he, e , i, in, ns, su, ul, lt, t , c, co, om, mi, ic, c , d, do, og, tri, riu, ium, ump, mph, ph , h t, th, the, he , e i, in, ins, nsu, sul, ult, lt , t c, co, com, omi, mic, ic , c d, do, dog, t
Re: DataImportHandler running out of memory
We must document this information in the wiki. We never had a chance to play w/ ms sql server --Noble On Thu, Jun 26, 2008 at 12:38 AM, wojtekpia <[EMAIL PROTECTED]> wrote: > > It looks like that was the problem. With responseBuffering=adaptive, I'm able > to load all my data using the sqljdbc driver. > -- > View this message in context: > http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18119732.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- --Noble Paul
Re: NGramTokenizer issue
Ok. Played a bit more with that. So I had a difference between my unit test and solr. In solr I'm actually using a solr.RemoveDuplicatesTokenFilterFactory when querying. Tried to add that to the test, and it fails. So in my case I think the error is trying to use a solr.RemoveDuplicatesTokenFilterFactory with a solr.NGramTokenizerFactory. I don't know why using solr.RemoveDuplicatesTokenFilterFactory generates "do og dog" for "dog" when not using it will just generate "do og". Either way I think that when using ngram I shouldn't use RemoveDuplicatesTokenFilterFactory. Removing duplicates might change the structure of the word. On Thu, Jun 26, 2008 at 12:25 AM, Jonathan Ariel <[EMAIL PROTECTED]> wrote: > Well, it is working if I search just two letters, but that just tells me > that something is wrong somewhere. > The Analysis tools is showing me how "dog" is being tokenized to "do og", > so if when indexing and querying I'm using the same tokenizer/filters (which > is my case) I should get results even when searching "dog". > > I've just created a small unit test in solr to try that out. > > public void testNGram() throws IOException, Exception { > assertU("adding doc with ngram field",adoc("id", "42", > "text_ngram", "nice dog")); > assertU("commiting",commit()); > > assertQ("test query, expect one document", > req("text_ngram:dog") > ,"//[EMAIL PROTECTED]'1']" > ); > } > > As you can see I'm adding a document with the field text_ngram with the > value "nice dog". > Then I commit it and query it for "text_ngram:dog". > > text_ngram is defined in the schema as: > > > minGramSize="2" /> > > > > minGramSize="2" /> > > > > > This test passes. That means that I am able to get results when searching > "dog" on a ngram field, where min and max are set to 2 and where the value > of that field is "nice dog". > So it doesn't seems to be a issue in solr, although I am having this error > when using solr outside the unit test. It seems very improbable to think on > an environment issue. > > Maybe I am doing something wrong. Any thoughts on that? > > Thanks! > > Jonathan > > > On Wed, Jun 25, 2008 at 9:44 PM, Norberto Meijome <[EMAIL PROTECTED]> > wrote: > >> On Wed, 25 Jun 2008 15:37:09 -0300 >> "Jonathan Ariel" <[EMAIL PROTECTED]> wrote: >> >> > I've been trying to use the NGramTokenizer and I ran into a problem. >> > It seems like solr is trying to match documents with all the tokens that >> the >> > analyzer returns from the query term. So if I index a document with a >> title >> > field with the value "nice dog" and search for "dog" (where the >> > NGramtokenizer is defined to generate tokens of min 2 and max 2) I won't >> get >> > any results. >> >> Hi Jonathan, >> I don't have the expertise yet to have gone straight into testing code >> with >> lucene, but my 'black box' testing with ngramtokenizer seems to agree with >> what >> you found - see my latest posts over the last couple of days about this. >> >> Have you tried searching for 'do' or 'ni' or any search term with size = >> minGramSize ? I've found that Solr matches results just fine then. >> >> > I can see in the Analysis tool that the tokenizer generates the right >> > tokens, but then when solr searches it tries to match the exact Phrase >> > instead of the tokens. >> >> +1 >> >> B >> >> _ >> {Beto|Norberto|Numard} Meijome >> >> "Some cause happiness wherever they go; others, whenever they go." >> Oscar Wilde >> >> I speak for myself, not my employer. Contents may be hot. Slippery when >> wet. >> Reading disclaimers makes you go blind. Writing them is worse. You have >> been >> Warned. >> > >
Re: NGramTokenizer issue
On Thu, 26 Jun 2008 01:15:34 -0300 "Jonathan Ariel" <[EMAIL PROTECTED]> wrote: > Ok. Played a bit more with that. > So I had a difference between my unit test and solr. In solr I'm actually > using a solr.RemoveDuplicatesTokenFilterFactory when querying. Tried to add > that to the test, and it fails. > So in my case I think the error is trying to use a > solr.RemoveDuplicatesTokenFilterFactory with a solr.NGramTokenizerFactory. I > don't know why using solr.RemoveDuplicatesTokenFilterFactory generates "do > og dog" for "dog" when not using it will just generate "do og". > Either way I think that when using ngram I shouldn't use > RemoveDuplicatesTokenFilterFactory. Removing duplicates might change the > structure of the word. Hi Jonathan, My apologies, i found the issue with removeDuplicates late last night and I forgot to mention it. The 5 tests i included in my other email don't use removeDuplicate for this reason. I am still interested to know why one of them is failing, when analysis.jsp + common_sense ;) say it should. B _ {Beto|Norberto|Numard} Meijome Q. How do you make God laugh? A. Tell him your plans. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.