from:"Alexander Wallace"

Strange missing docs when reindexing with threads.

2009-06-12 Thread Alexander Wallace


Hi all!

I'm using Solr 1.3 and currently testing reindexing...

In my client app, i am sending 17494 requests to add documents...  In 3 
different scenarios:


a) not using threads
b) using 1 thread
c) using 2 threads

In scenario a), everything seems to work fine... In my client log, is 
see 17494 requests sent to solr, in solr's log, I see the same number of 
'add' requests received, and If i search the index, i can see the same 
amount of documents.


However, if I use 1 thread, I see the right amount of requests in logs, 
but I only find 15k or so documents (this varies a bit every time i run 
this scenario).


It gets way worse if I use 2 threads... I can see the right amount of 
requests in both logs, but i end up with ~ 600 docs in the index!


In all scenarios, I don't see any errors on the logs...

As you can imagine, I need to be able to use multiple threads to speed 
up the process... It is also very concertning that I don't get any 
errors anywhere...


Looking at solr's admin stats, I see also 17494 cumulative adds, but 
only a tiny fraction of actual documents can be found...


Any clues?

BTW, these indexers work fine if I use lucene straight...

Thanks in advance for all your help!

Re: Strange missing docs when reindexing with threads.

2009-06-12 Thread Alexander Wallace

Right after I sent the email I went on and checked for uniqueness of 
documents...


In theory the were all supposed to be unique... But i've realized that 
the platform I'm using to reindex, is delaying sending the requests, 
this in combination with my reindexers reusing document fields (instead 
of creating new instances to save on GC) lead to the same document being 
sent many times with invalid data...


I am fairly sure now that this is the source of my problem... My 
reindexers originally used LuceneWriter directly, which blocks thread 
excecution until the document is added to the index, and the new 
framework i'm using uses messaging which releases control back to the 
thread before the documents are actually sent to be indexed, my threads 
update the document fields meanwhile, so the data written to the index 
is transitioning and invalid...


I've done an adjustment to my reindexing threads to ensure new instances 
of everything are used... I will test it shortly...


But you point out exactly why i have less documents than 'add' requests...

Thanks!

Shalin Shekhar Mangar wrote:

On Fri, Jun 12, 2009 at 11:40 PM, Alexander Wallace  wrote:

  

Hi all!

I'm using Solr 1.3 and currently testing reindexing...

In my client app, i am sending 17494 requests to add documents...  In 3
different scenarios:

a) not using threads
b) using 1 thread
c) using 2 threads

In scenario a), everything seems to work fine... In my client log, is see
17494 requests sent to solr, in solr's log, I see the same number of 'add'
requests received, and If i search the index, i can see the same amount of
documents.

However, if I use 1 thread, I see the right amount of requests in logs, but
I only find 15k or so documents (this varies a bit every time i run this
scenario).

It gets way worse if I use 2 threads... I can see the right amount of
requests in both logs, but i end up with ~ 600 docs in the index!

In all scenarios, I don't see any errors on the logs...

As you can imagine, I need to be able to use multiple threads to speed up
the process... It is also very concertning that I don't get any errors
anywhere...

Looking at solr's admin stats, I see also 17494 cumulative adds, but only a
tiny fraction of actual documents can be found...

Any clues?




What is the uniqueKey in your schema.xml? Is it possible that those 17494
documents have a common uniqueKey and are therefore getting overwritten?

Re: Strange missing docs when reindexing with threads.

2009-06-12 Thread Alexander Wallace

That was exactly my issue... i changed my code to not reuse 
document/fields and it is all good now!


Thanks for your support!

Shalin Shekhar Mangar wrote:

On Fri, Jun 12, 2009 at 11:40 PM, Alexander Wallace  wrote:

  

Hi all!

I'm using Solr 1.3 and currently testing reindexing...

In my client app, i am sending 17494 requests to add documents...  In 3
different scenarios:

a) not using threads
b) using 1 thread
c) using 2 threads

In scenario a), everything seems to work fine... In my client log, is see
17494 requests sent to solr, in solr's log, I see the same number of 'add'
requests received, and If i search the index, i can see the same amount of
documents.

However, if I use 1 thread, I see the right amount of requests in logs, but
I only find 15k or so documents (this varies a bit every time i run this
scenario).

It gets way worse if I use 2 threads... I can see the right amount of
requests in both logs, but i end up with ~ 600 docs in the index!

In all scenarios, I don't see any errors on the logs...

As you can imagine, I need to be able to use multiple threads to speed up
the process... It is also very concertning that I don't get any errors
anywhere...

Looking at solr's admin stats, I see also 17494 cumulative adds, but only a
tiny fraction of actual documents can be found...

Any clues?




What is the uniqueKey in your schema.xml? Is it possible that those 17494
documents have a common uniqueKey and are therefore getting overwritten?

Popular keywords statistics .

2009-07-03 Thread Alexander Wallace


Hi all!

I'd like to know if there is anything built into Solr that keeps track 
of keywords being searched and has statistics of those?


If not, and or in any case, I'd like to hear what approaches are being 
used by users to know what people is searching for in their apps.


Thanks in advance!

Re: Popular keywords statistics .

2009-07-06 Thread Alexander Wallace


Indeed that was one of the first  approaches...

Thanks a lot!

Michael Ludwig wrote:

Wallace schrieb:

I'd like to hear what approaches are being used by users to know what
people is searching for in their apps.


You could process the access log.

You could write a filter servlet logging the relevant part of the query
string to a dedicated location.

Michael Ludwig

Solr cluster topology.

2007-11-20 Thread Alexander Wallace


Hi All!

I just started reading about Solr a couple of days ago (not full time  
of course) and it looks like a pretty impressive set of  
technologies... I have still a few questions I have not clearly found:


Q: On a cluster, as I understand it, one and only one machine is a  
master, and N servers could be slaves...The clients, do they all  
talk to the master for indexing and to a load balancer for  
searching?   Is one particular machine configured to know it is the  
master? Or is it only the settings for replicating the index that  
matter?   Or does one post reindex petitions to any of the slaves and  
they will forward it to the master?


How can we have failover in the master?

It is a correct assumption that slaves could always be a bit out of  
sync with the master, correct? A matter of minutes perhaps...


Thanks in advance for your responses!

Re: Solr cluster topology.

2007-11-20 Thread Alexander Wallace


Thanks for the response!

Interesting, this ALL MASTERS mode... I guess you don't do any  
replication then...


In the single master, several slaves mode, I'm assuming the client  
still writes to one and reads from the others... right?


On Nov 20, 2007, at 12:54 PM, Matthew Runo wrote:


Yes. The clients will always be a minute or two behind the master.

I like the way some people are doing it - make them all masters!  
Just post your updates to each of them - you loose a bit of  
performance perhaps, but it doesn't matter if a server bombs out or  
you have to upgrade them, since they're all exactly the same.


--Matthew

On Nov 20, 2007, at 7:43 AM, Alexander Wallace wrote:


Hi All!

I just started reading about Solr a couple of days ago (not full  
time of course) and it looks like a pretty impressive set of  
technologies... I have still a few questions I have not clearly  
found:


Q: On a cluster, as I understand it, one and only one machine is a  
master, and N servers could be slaves...The clients, do they  
all talk to the master for indexing and to a load balancer for  
searching?   Is one particular machine configured to know it is  
the master? Or is it only the settings for replicating the index  
that matter?   Or does one post reindex petitions to any of the  
slaves and they will forward it to the master?


How can we have failover in the master?

It is a correct assumption that slaves could always be a bit out  
of sync with the master, correct? A matter of minutes perhaps...


Thanks in advance for your responses!

Re: Solr cluster topology.

2007-11-21 Thread Alexander Wallace


Thanks a lot for your responses! They were all very helpful!

On Nov 20, 2007, at 5:52 PM, Norberto Meijome wrote:


On Tue, 20 Nov 2007 16:26:27 -0600
Alexander Wallace <[EMAIL PROTECTED]> wrote:


Interesting, this ALL MASTERS mode... I guess you don't do any
replication then...


correct


In the single master, several slaves mode, I'm assuming the client
still writes to one and reads from the others... right?


Correct again.

There is also another approach which I think in SOLR is called  
FederatedSearch , where a front end queries a number of index  
servers (each with overlapping or non-overlapping data sets) and  
puts together 1 result stream for the answer. There was some  
discussion on the list,  http://www.mail-archive.com/solr- 
[EMAIL PROTECTED]/msg06081.html is the earliest link in the  
archive i can find .


B
_
{Beto|Norberto|Numard} Meijome

"People demand freedom of speech to make up for the freedom of  
thought which they avoid. "

  Soren Aabye Kierkegaard

I speak for myself, not my employer. Contents may be hot. Slippery  
when wet. Reading disclaimers makes you go blind. Writing them is  
worse. You have been Warned.

Strange missing docs when reindexing with threads.

Re: Strange missing docs when reindexing with threads.

Re: Strange missing docs when reindexing with threads.

Popular keywords statistics .

Re: Popular keywords statistics .

Solr cluster topology.

Re: Solr cluster topology.

Re: Solr cluster topology.

8 matches

Site Navigation

Mail list logo

Footer information