Strange missing docs when reindexing with threads.
Hi all! I'm using Solr 1.3 and currently testing reindexing... In my client app, i am sending 17494 requests to add documents... In 3 different scenarios: a) not using threads b) using 1 thread c) using 2 threads In scenario a), everything seems to work fine... In my client log, is see 17494 requests sent to solr, in solr's log, I see the same number of 'add' requests received, and If i search the index, i can see the same amount of documents. However, if I use 1 thread, I see the right amount of requests in logs, but I only find 15k or so documents (this varies a bit every time i run this scenario). It gets way worse if I use 2 threads... I can see the right amount of requests in both logs, but i end up with ~ 600 docs in the index! In all scenarios, I don't see any errors on the logs... As you can imagine, I need to be able to use multiple threads to speed up the process... It is also very concertning that I don't get any errors anywhere... Looking at solr's admin stats, I see also 17494 cumulative adds, but only a tiny fraction of actual documents can be found... Any clues? BTW, these indexers work fine if I use lucene straight... Thanks in advance for all your help!
Re: Strange missing docs when reindexing with threads.
Right after I sent the email I went on and checked for uniqueness of documents... In theory the were all supposed to be unique... But i've realized that the platform I'm using to reindex, is delaying sending the requests, this in combination with my reindexers reusing document fields (instead of creating new instances to save on GC) lead to the same document being sent many times with invalid data... I am fairly sure now that this is the source of my problem... My reindexers originally used LuceneWriter directly, which blocks thread excecution until the document is added to the index, and the new framework i'm using uses messaging which releases control back to the thread before the documents are actually sent to be indexed, my threads update the document fields meanwhile, so the data written to the index is transitioning and invalid... I've done an adjustment to my reindexing threads to ensure new instances of everything are used... I will test it shortly... But you point out exactly why i have less documents than 'add' requests... Thanks! Shalin Shekhar Mangar wrote: On Fri, Jun 12, 2009 at 11:40 PM, Alexander Wallace wrote: Hi all! I'm using Solr 1.3 and currently testing reindexing... In my client app, i am sending 17494 requests to add documents... In 3 different scenarios: a) not using threads b) using 1 thread c) using 2 threads In scenario a), everything seems to work fine... In my client log, is see 17494 requests sent to solr, in solr's log, I see the same number of 'add' requests received, and If i search the index, i can see the same amount of documents. However, if I use 1 thread, I see the right amount of requests in logs, but I only find 15k or so documents (this varies a bit every time i run this scenario). It gets way worse if I use 2 threads... I can see the right amount of requests in both logs, but i end up with ~ 600 docs in the index! In all scenarios, I don't see any errors on the logs... As you can imagine, I need to be able to use multiple threads to speed up the process... It is also very concertning that I don't get any errors anywhere... Looking at solr's admin stats, I see also 17494 cumulative adds, but only a tiny fraction of actual documents can be found... Any clues? What is the uniqueKey in your schema.xml? Is it possible that those 17494 documents have a common uniqueKey and are therefore getting overwritten?
Re: Strange missing docs when reindexing with threads.
That was exactly my issue... i changed my code to not reuse document/fields and it is all good now! Thanks for your support! Shalin Shekhar Mangar wrote: On Fri, Jun 12, 2009 at 11:40 PM, Alexander Wallace wrote: Hi all! I'm using Solr 1.3 and currently testing reindexing... In my client app, i am sending 17494 requests to add documents... In 3 different scenarios: a) not using threads b) using 1 thread c) using 2 threads In scenario a), everything seems to work fine... In my client log, is see 17494 requests sent to solr, in solr's log, I see the same number of 'add' requests received, and If i search the index, i can see the same amount of documents. However, if I use 1 thread, I see the right amount of requests in logs, but I only find 15k or so documents (this varies a bit every time i run this scenario). It gets way worse if I use 2 threads... I can see the right amount of requests in both logs, but i end up with ~ 600 docs in the index! In all scenarios, I don't see any errors on the logs... As you can imagine, I need to be able to use multiple threads to speed up the process... It is also very concertning that I don't get any errors anywhere... Looking at solr's admin stats, I see also 17494 cumulative adds, but only a tiny fraction of actual documents can be found... Any clues? What is the uniqueKey in your schema.xml? Is it possible that those 17494 documents have a common uniqueKey and are therefore getting overwritten?
Popular keywords statistics .
Hi all! I'd like to know if there is anything built into Solr that keeps track of keywords being searched and has statistics of those? If not, and or in any case, I'd like to hear what approaches are being used by users to know what people is searching for in their apps. Thanks in advance!
Re: Popular keywords statistics .
Indeed that was one of the first approaches... Thanks a lot! Michael Ludwig wrote: Wallace schrieb: I'd like to hear what approaches are being used by users to know what people is searching for in their apps. You could process the access log. You could write a filter servlet logging the relevant part of the query string to a dedicated location. Michael Ludwig
Solr cluster topology.
Hi All! I just started reading about Solr a couple of days ago (not full time of course) and it looks like a pretty impressive set of technologies... I have still a few questions I have not clearly found: Q: On a cluster, as I understand it, one and only one machine is a master, and N servers could be slaves...The clients, do they all talk to the master for indexing and to a load balancer for searching? Is one particular machine configured to know it is the master? Or is it only the settings for replicating the index that matter? Or does one post reindex petitions to any of the slaves and they will forward it to the master? How can we have failover in the master? It is a correct assumption that slaves could always be a bit out of sync with the master, correct? A matter of minutes perhaps... Thanks in advance for your responses!
Re: Solr cluster topology.
Thanks for the response! Interesting, this ALL MASTERS mode... I guess you don't do any replication then... In the single master, several slaves mode, I'm assuming the client still writes to one and reads from the others... right? On Nov 20, 2007, at 12:54 PM, Matthew Runo wrote: Yes. The clients will always be a minute or two behind the master. I like the way some people are doing it - make them all masters! Just post your updates to each of them - you loose a bit of performance perhaps, but it doesn't matter if a server bombs out or you have to upgrade them, since they're all exactly the same. --Matthew On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote: Hi All! I just started reading about Solr a couple of days ago (not full time of course) and it looks like a pretty impressive set of technologies... I have still a few questions I have not clearly found: Q: On a cluster, as I understand it, one and only one machine is a master, and N servers could be slaves...The clients, do they all talk to the master for indexing and to a load balancer for searching? Is one particular machine configured to know it is the master? Or is it only the settings for replicating the index that matter? Or does one post reindex petitions to any of the slaves and they will forward it to the master? How can we have failover in the master? It is a correct assumption that slaves could always be a bit out of sync with the master, correct? A matter of minutes perhaps... Thanks in advance for your responses!
Re: Solr cluster topology.
Thanks a lot for your responses! They were all very helpful! On Nov 20, 2007, at 5:52 PM, Norberto Meijome wrote: On Tue, 20 Nov 2007 16:26:27 -0600 Alexander Wallace <[EMAIL PROTECTED]> wrote: Interesting, this ALL MASTERS mode... I guess you don't do any replication then... correct In the single master, several slaves mode, I'm assuming the client still writes to one and reads from the others... right? Correct again. There is also another approach which I think in SOLR is called FederatedSearch , where a front end queries a number of index servers (each with overlapping or non-overlapping data sets) and puts together 1 result stream for the answer. There was some discussion on the list, http://www.mail-archive.com/solr- [EMAIL PROTECTED]/msg06081.html is the earliest link in the archive i can find . B _ {Beto|Norberto|Numard} Meijome "People demand freedom of speech to make up for the freedom of thought which they avoid. " Soren Aabye Kierkegaard I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.