date:20080630

analyzer index vs query vs {missing}

2008-06-30 Thread Norberto Meijome

hi there,
when defining a field type, i understand the meaning of 'analyzer type="index"' 
, or type="query". What does it mean when the type is missing? does it apply at 
both index and query ?
This can be found in the example's schema.xml :



  



  



thanks!
B

_
{Beto|Norberto|Numard} Meijome

"Humans die and turn to dust, but writing makes us remembered"
  4000-year-old words of an Egyptian scribe

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Solr Master Slave Architecture over NFS

2008-06-30 Thread Nico Heid

Hey, I'm looking for some feedback on the following setup.
Due to the architects decision I will be working with NFS not Solr's own 
distribution scripts.

A few Solr indexing machines use Multicore to divide the 300.000 Users to 1000 
shards.
For several reasons we have to go with per user sharding (as you can see 300 
per shard) Updates come in with about 166 updates per hour on each shard. So 
not a problem.

The question lies more in this concept: I set up a few Query Slaves, using NFS 
readonly mounts.
I do not use the index directory for the readonly slaves. I patched the slaves 
to use the most recent snapshot directory to avoid all the nasty nfs issues. 
(only a quick and dirty hack for testing) On a not yet defined interval I do a 
snapshot on the masters and send a http commit to the slave, so a new reader 
on the fresh snapshot is opened.
This seems to work without trouble so far, but I've not done extensive 
testing.

To take this a step further (only an idea yet). I let the slaves work on the 
real index, as long as I do not optimize. Because the directory structure is 
not changing as long as I do not optimize, I can send commits to the slaves. 
Before I optimize I take a snapshot, send them a special "commit" to make them 
fall back to the most recent snapshot dir, optimize the index and send them a 
real commit when done.
Even though a little trickier I would be more up to date with the query 
slaves.

So if you have any design comments or see major or minor flaws, feedback would 
be very welcome.

I do not use live data yet, this is the experimental stage. But I'll give 
feedback on how it performs and what issues I run into. There's also the faint 
chance of letting this setup (or a "fixed" one) run on the real user data, 
which would be roughly 20TB of usable data for indexing. This would be really 
interesting :-)

Have a nice week
Nico

RE: Benchmarking tools?

2008-06-30 Thread Nico Heid

Hi,
I did some trivial Tests with Jmeter.
I set up Jmeter to increase the number of threads steadily.
For requests I either usa a random word or combination of words in a
wordlist or some sample date from the test system. (this is described in the
JMeter manual)

In my case the System works fine as long as I don't exceed the max number of
requests per second it can handel. But thats not a big surprise. More
interesting seems the fact, that to a certain degree, after exceeding the
max nr of requests response time seems to rise linear for a little while and
then exponentially. But that might also be the result of my test szenario.

Nico


> -Original Message-
> From: Jacob Singh [mailto:[EMAIL PROTECTED]
> Sent: Sunday, June 29, 2008 6:04 PM
> To: solr-user@lucene.apache.org
> Subject: Benchmarking tools?
>
> Hi folks,
>
> Does anyone have any bright ideas on how to benchmark solr?
> Unless someone has something better, here is what I am thinking:
>
> 1. Have a config file where one can specify info like how
> many docs, how large, how many facets, and how many updates /
> searches per minute
>
> 2. Use one of the various client APIs to generate XML files
> for updates using some kind of lorem ipsum text as a base and
> store them in a dir.
>
> 3. Use siege to set the update run at whatever interval is
> specified in the config, sending an update every x seconds
> and removing it from the directory
>
> 4. Generate a list of search queries based upon the facets
> created, and build a urls.txt with all of these search urls
>
> 5. Run the searches through siege
>
> 6. Monitor the output using nagios to see where load kicks in.
>
> This is not that sophisticated, and feels like it won't
> really pinpoint bottlenecks, but would aproximately tell us
> where a server will start to bail.
>
> Does anyone have any better ideas?
>
> Best,
> Jacob Singh
>

Limit Porter stemmer to plural stemming only?

2008-06-30 Thread climbingrose

Hi all,
Porter stemmer in general is really good. However, there are some cases
where it doesn't work. For example, "accountant" matches "Accountant" as
well as "Account Manager" which isn't desirable. Is it possible to use this
analyser for plural words only? For example:
+Accountant -> accountant
+Accountants -> accountant
+Account -> Account
+Accounts -> account

Thanks.

-- 
Regards,

Cuong Hoang

Re: Benchmarking tools?

2008-06-30 Thread Jacob Singh

Hi Nico,

Thanks for the info. Do you have you scripts available for this?

Also, is it configurable to give variable numbers of facets and facet
based searches?  I have a feeling this will be the limiting factor, and
much slower than keyword searches but I could be (and usually am) wrong.

Best,

Jacob

Nico Heid wrote:
> Hi,
> I did some trivial Tests with Jmeter.
> I set up Jmeter to increase the number of threads steadily.
> For requests I either usa a random word or combination of words in a
> wordlist or some sample date from the test system. (this is described in the
> JMeter manual)
> 
> In my case the System works fine as long as I don't exceed the max number of
> requests per second it can handel. But thats not a big surprise. More
> interesting seems the fact, that to a certain degree, after exceeding the
> max nr of requests response time seems to rise linear for a little while and
> then exponentially. But that might also be the result of my test szenario.
> 
> Nico
> 
> 
>> -Original Message-
>> From: Jacob Singh [mailto:[EMAIL PROTECTED]
>> Sent: Sunday, June 29, 2008 6:04 PM
>> To: solr-user@lucene.apache.org
>> Subject: Benchmarking tools?
>>
>> Hi folks,
>>
>> Does anyone have any bright ideas on how to benchmark solr?
>> Unless someone has something better, here is what I am thinking:
>>
>> 1. Have a config file where one can specify info like how
>> many docs, how large, how many facets, and how many updates /
>> searches per minute
>>
>> 2. Use one of the various client APIs to generate XML files
>> for updates using some kind of lorem ipsum text as a base and
>> store them in a dir.
>>
>> 3. Use siege to set the update run at whatever interval is
>> specified in the config, sending an update every x seconds
>> and removing it from the directory
>>
>> 4. Generate a list of search queries based upon the facets
>> created, and build a urls.txt with all of these search urls
>>
>> 5. Run the searches through siege
>>
>> 6. Monitor the output using nagios to see where load kicks in.
>>
>> This is not that sophisticated, and feels like it won't
>> really pinpoint bottlenecks, but would aproximately tell us
>> where a server will start to bail.
>>
>> Does anyone have any better ideas?
>>
>> Best,
>> Jacob Singh
>>
> 
>

Re: Limit Porter stemmer to plural stemming only?

2008-06-30 Thread climbingrose

Ok, it looks like step 1a in Porter algo does what I need.
On Mon, Jun 30, 2008 at 6:39 PM, climbingrose <[EMAIL PROTECTED]>
wrote:

> Hi all,
> Porter stemmer in general is really good. However, there are some cases
> where it doesn't work. For example, "accountant" matches "Accountant" as
> well as "Account Manager" which isn't desirable. Is it possible to use this
> analyser for plural words only? For example:
> +Accountant -> accountant
> +Accountants -> accountant
> +Account -> Account
> +Accounts -> account
>
> Thanks.
>
> --
> Regards,
>
> Cuong Hoang
>



-- 
Regards,

Cuong Hoang

Re: analyzer index vs query vs {missing}

2008-06-30 Thread Erik Hatcher


Yes, that's exactly what it means.

Erik


On Jun 30, 2008, at 3:01 AM, Norberto Meijome wrote:


hi there,
when defining a field type, i understand the meaning of 'analyzer  
type="index"' , or type="query". What does it mean when the type is  
missing? does it apply at both index and query ?

This can be found in the example's schema.xml :

   
   positionIncrementGap="100" >

 
   
   
   
 
   


thanks!
B

_
{Beto|Norberto|Numard} Meijome

"Humans die and turn to dust, but writing makes us remembered"
 4000-year-old words of an Egyptian scribe

I speak for myself, not my employer. Contents may be hot. Slippery  
when wet. Reading disclaimers makes you go blind. Writing them is  
worse. You have been Warned.

1.3 maven artifact

2008-06-30 Thread Stefan Oestreicher

Hi,

I just wanted to ask if solr 1.3 is already available as maven artifact? If
it is not could you give me an estimate on when it will be?

TIA,
 
Stefan Oestreicher
 
--
Dr. Maté GmbH
Stefan Oestreicher / Entwicklung
[EMAIL PROTECTED]
http://www.netdoktor.at
Tel Buero: + 43 1 405 55 75 24
Fax Buero: + 43 1 405 55 75 55
Alser Str. 4 1090 Wien Altes AKH Hof 1 1.6.6

Re: Benchmarking tools?

2008-06-30 Thread Nico Heid


Hi,
I basically followed this:
http://wiki.apache.org/jakarta-jmeter/JMeterFAQ#head-1680863678257fbcb85bd97351860eb0049f19ae

I basically put all my queries in a flat text file. you could either use 
two parameters or put them in one file.
The good point of this is, that each test uses the same queries, so you 
can compare the settings better afterwards.


If you use varying facets, you might just go with 2 text files. If it 
stays the same in one test you can hardcode it into the test case.


I polished the result a little, if you want to take a look: 
http://i31.tinypic.com/28c2blk.jpg , JMeter itself does not plot such 
nice graphs.
(green is the max results delivered, upon 66 "active users" per second 
the response time increases (orange/yellow, average and median of the 
response times)
(i know the scales and descriptions are missing :-) but you should get 
the picture)
I manually reduced the machines capacity, elsewise solr would server 
more than 12000 requests per second. (the whole index did fit into ram)

I can send you my saved test case if this would help you.

Nico


Jacob Singh wrote:

Hi Nico,

Thanks for the info. Do you have you scripts available for this?

Also, is it configurable to give variable numbers of facets and facet
based searches?  I have a feeling this will be the limiting factor, and
much slower than keyword searches but I could be (and usually am) wrong.

Best,

Jacob

Nico Heid wrote:
  

Hi,
I did some trivial Tests with Jmeter.
I set up Jmeter to increase the number of threads steadily.
For requests I either usa a random word or combination of words in a
wordlist or some sample date from the test system. (this is described in the
JMeter manual)

In my case the System works fine as long as I don't exceed the max number of
requests per second it can handel. But thats not a big surprise. More
interesting seems the fact, that to a certain degree, after exceeding the
max nr of requests response time seems to rise linear for a little while and
then exponentially. But that might also be the result of my test szenario.

Nico




-Original Message-
From: Jacob Singh [mailto:[EMAIL PROTECTED]
Sent: Sunday, June 29, 2008 6:04 PM
To: solr-user@lucene.apache.org
Subject: Benchmarking tools?

Hi folks,

Does anyone have any bright ideas on how to benchmark solr?
Unless someone has something better, here is what I am thinking:

1. Have a config file where one can specify info like how
many docs, how large, how many facets, and how many updates /
searches per minute

2. Use one of the various client APIs to generate XML files
for updates using some kind of lorem ipsum text as a base and
store them in a dir.

3. Use siege to set the update run at whatever interval is
specified in the config, sending an update every x seconds
and removing it from the directory

4. Generate a list of search queries based upon the facets
created, and build a urls.txt with all of these search urls

5. Run the searches through siege

6. Monitor the output using nagios to see where load kicks in.

This is not that sophisticated, and feels like it won't
really pinpoint bottlenecks, but would aproximately tell us
where a server will start to bail.

Does anyone have any better ideas?

Best,
Jacob Singh

Re: analyzer index vs query vs {missing}

2008-06-30 Thread Norberto Meijome

On Mon, 30 Jun 2008 05:52:33 -0400
Erik Hatcher <[EMAIL PROTECTED]> wrote:

> Yes, that's exactly what it means.
> 
>   Erik

great, thanks for the clarification.
B

_
{Beto|Norberto|Numard} Meijome

"A dream you dream together is reality."
  John Lennon

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: Benchmarking tools?

2008-06-30 Thread Jacob Singh

nice stuff. Please send me the test case, I'd love to see it.

Thanks,
Jacob
Nico Heid wrote:
> Hi,
> I basically followed this:
> http://wiki.apache.org/jakarta-jmeter/JMeterFAQ#head-1680863678257fbcb85bd97351860eb0049f19ae
> 
> 
> I basically put all my queries in a flat text file. you could either use
> two parameters or put them in one file.
> The good point of this is, that each test uses the same queries, so you
> can compare the settings better afterwards.
> 
> If you use varying facets, you might just go with 2 text files. If it
> stays the same in one test you can hardcode it into the test case.
> 
> I polished the result a little, if you want to take a look:
> http://i31.tinypic.com/28c2blk.jpg , JMeter itself does not plot such
> nice graphs.
> (green is the max results delivered, upon 66 "active users" per second
> the response time increases (orange/yellow, average and median of the
> response times)
> (i know the scales and descriptions are missing :-) but you should get
> the picture)
> I manually reduced the machines capacity, elsewise solr would server
> more than 12000 requests per second. (the whole index did fit into ram)
> I can send you my saved test case if this would help you.
> 
> Nico
> 
> 
> Jacob Singh wrote:
>> Hi Nico,
>>
>> Thanks for the info. Do you have you scripts available for this?
>>
>> Also, is it configurable to give variable numbers of facets and facet
>> based searches?  I have a feeling this will be the limiting factor, and
>> much slower than keyword searches but I could be (and usually am) wrong.
>>
>> Best,
>>
>> Jacob
>>
>> Nico Heid wrote:
>>  
>>> Hi,
>>> I did some trivial Tests with Jmeter.
>>> I set up Jmeter to increase the number of threads steadily.
>>> For requests I either usa a random word or combination of words in a
>>> wordlist or some sample date from the test system. (this is described
>>> in the
>>> JMeter manual)
>>>
>>> In my case the System works fine as long as I don't exceed the max
>>> number of
>>> requests per second it can handel. But thats not a big surprise. More
>>> interesting seems the fact, that to a certain degree, after exceeding
>>> the
>>> max nr of requests response time seems to rise linear for a little
>>> while and
>>> then exponentially. But that might also be the result of my test
>>> szenario.
>>>
>>> Nico
>>>
>>>
>>>
 -Original Message-
 From: Jacob Singh [mailto:[EMAIL PROTECTED]
 Sent: Sunday, June 29, 2008 6:04 PM
 To: solr-user@lucene.apache.org
 Subject: Benchmarking tools?

 Hi folks,

 Does anyone have any bright ideas on how to benchmark solr?
 Unless someone has something better, here is what I am thinking:

 1. Have a config file where one can specify info like how
 many docs, how large, how many facets, and how many updates /
 searches per minute

 2. Use one of the various client APIs to generate XML files
 for updates using some kind of lorem ipsum text as a base and
 store them in a dir.

 3. Use siege to set the update run at whatever interval is
 specified in the config, sending an update every x seconds
 and removing it from the directory

 4. Generate a list of search queries based upon the facets
 created, and build a urls.txt with all of these search urls

 5. Run the searches through siege

 6. Monitor the output using nagios to see where load kicks in.

 This is not that sophisticated, and feels like it won't
 really pinpoint bottlenecks, but would aproximately tell us
 where a server will start to bail.

 Does anyone have any better ideas?

 Best,
 Jacob Singh

   
>>> 
>

Minimum JDK for SolrJ?

2008-06-30 Thread Todd Breiholz

What is the minimum JDK that can be used for developing clients that use
SolrJ? I am stuck on JDK 1.4.2 at the moment and am wondering if SolrJ is an
option for me.

Thanks!

Todd

Re: Minimum JDK for SolrJ?

2008-06-30 Thread Noble Paul നോബിള്‍ नोब्ळ्

SolrJ needs a minimum java 5
--Noble

On Mon, Jun 30, 2008 at 8:00 PM, Todd Breiholz <[EMAIL PROTECTED]> wrote:
> What is the minimum JDK that can be used for developing clients that use
> SolrJ? I am stuck on JDK 1.4.2 at the moment and am wondering if SolrJ is an
> option for me.
>
> Thanks!
>
> Todd
>



-- 
--Noble Paul

Re: Benchmarking tools?

2008-06-30 Thread Yugang Hu


Me too. Thanks.

Jacob Singh wrote:

nice stuff. Please send me the test case, I'd love to see it.

Thanks,
Jacob
Nico Heid wrote:
  

Hi,
I basically followed this:
http://wiki.apache.org/jakarta-jmeter/JMeterFAQ#head-1680863678257fbcb85bd97351860eb0049f19ae


I basically put all my queries in a flat text file. you could either use
two parameters or put them in one file.
The good point of this is, that each test uses the same queries, so you
can compare the settings better afterwards.

If you use varying facets, you might just go with 2 text files. If it
stays the same in one test you can hardcode it into the test case.

I polished the result a little, if you want to take a look:
http://i31.tinypic.com/28c2blk.jpg , JMeter itself does not plot such
nice graphs.
(green is the max results delivered, upon 66 "active users" per second
the response time increases (orange/yellow, average and median of the
response times)
(i know the scales and descriptions are missing :-) but you should get
the picture)
I manually reduced the machines capacity, elsewise solr would server
more than 12000 requests per second. (the whole index did fit into ram)
I can send you my saved test case if this would help you.

Nico


Jacob Singh wrote:


Hi Nico,

Thanks for the info. Do you have you scripts available for this?

Also, is it configurable to give variable numbers of facets and facet
based searches?  I have a feeling this will be the limiting factor, and
much slower than keyword searches but I could be (and usually am) wrong.

Best,

Jacob

Nico Heid wrote:
 
  

Hi,
I did some trivial Tests with Jmeter.
I set up Jmeter to increase the number of threads steadily.
For requests I either usa a random word or combination of words in a
wordlist or some sample date from the test system. (this is described
in the
JMeter manual)

In my case the System works fine as long as I don't exceed the max
number of
requests per second it can handel. But thats not a big surprise. More
interesting seems the fact, that to a certain degree, after exceeding
the
max nr of requests response time seems to rise linear for a little
while and
then exponentially. But that might also be the result of my test
szenario.

Nico


   


-Original Message-
From: Jacob Singh [mailto:[EMAIL PROTECTED]
Sent: Sunday, June 29, 2008 6:04 PM
To: solr-user@lucene.apache.org
Subject: Benchmarking tools?

Hi folks,

Does anyone have any bright ideas on how to benchmark solr?
Unless someone has something better, here is what I am thinking:

1. Have a config file where one can specify info like how
many docs, how large, how many facets, and how many updates /
searches per minute

2. Use one of the various client APIs to generate XML files
for updates using some kind of lorem ipsum text as a base and
store them in a dir.

3. Use siege to set the update run at whatever interval is
specified in the config, sending an update every x seconds
and removing it from the directory

4. Generate a list of search queries based upon the facets
created, and build a urls.txt with all of these search urls

5. Run the searches through siege

6. Monitor the output using nagios to see where load kicks in.

This is not that sophisticated, and feels like it won't
really pinpoint bottlenecks, but would aproximately tell us
where a server will start to bail.

Does anyone have any better ideas?

Best,
Jacob Singh

Re: Solr Master Slave Architecture over NFS

2008-06-30 Thread Bill Au

Isn't using Lucene over NFS *not* recommended?

Bill

On Mon, Jun 30, 2008 at 4:27 AM, Nico Heid <[EMAIL PROTECTED]> wrote:

> Hey, I'm looking for some feedback on the following setup.
> Due to the architects decision I will be working with NFS not Solr's own
> distribution scripts.
>
> A few Solr indexing machines use Multicore to divide the 300.000 Users to
> 1000
> shards.
> For several reasons we have to go with per user sharding (as you can see
> 300
> per shard) Updates come in with about 166 updates per hour on each shard.
> So
> not a problem.
>
> The question lies more in this concept: I set up a few Query Slaves, using
> NFS
> readonly mounts.
> I do not use the index directory for the readonly slaves. I patched the
> slaves
> to use the most recent snapshot directory to avoid all the nasty nfs
> issues.
> (only a quick and dirty hack for testing) On a not yet defined interval I
> do a
> snapshot on the masters and send a http commit to the slave, so a new
> reader
> on the fresh snapshot is opened.
> This seems to work without trouble so far, but I've not done extensive
> testing.
>
> To take this a step further (only an idea yet). I let the slaves work on
> the
> real index, as long as I do not optimize. Because the directory structure
> is
> not changing as long as I do not optimize, I can send commits to the
> slaves.
> Before I optimize I take a snapshot, send them a special "commit" to make
> them
> fall back to the most recent snapshot dir, optimize the index and send them
> a
> real commit when done.
> Even though a little trickier I would be more up to date with the query
> slaves.
>
> So if you have any design comments or see major or minor flaws, feedback
> would
> be very welcome.
>
> I do not use live data yet, this is the experimental stage. But I'll give
> feedback on how it performs and what issues I run into. There's also the
> faint
> chance of letting this setup (or a "fixed" one) run on the real user data,
> which would be roughly 20TB of usable data for indexing. This would be
> really
> interesting :-)
>
> Have a nice week
> Nico
>
>
>

RE: UnicodeNormalizationFilterFactory

2008-06-30 Thread Steven A Rowe

Hi Robert,

Could you create a JIRA issue and attach your code to it?  That makes it easier 
for people to evaluate it (rather than just binary distribution).

This sounds general enough to me that it would be a useful addition to Lucene 
itself.  Solr's factory could just be sugar on top then.

Thanks,
Steve

On 06/26/2008 at 4:41 PM, Robert Haschart wrote:
> Lance Norskog wrote:
> 
> > ISOLatin1AccentFilterFactory works quite well for us. It solves our
> > basic euro-text keyboard searching problem, where "protege" should find
> > protégé. ("protege" with two accents.)
> > 
> > -Original Message-
> > From: Chris Hostetter [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, June 24, 2008 4:05 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: UnicodeNormalizationFilterFactory
> > 
> > 
> > > I've seen mention of these filters:
> > > 
> > >  
> > >  
> > 
> > Are you asking because you saw these in Robert Haschart's reply to your
> > previous question?  I think those are custom Filters that he has in his
> > project ... not open source (but i may be wrong)
> > 
> > they are certainly not something that comes out of the box w/ Solr.
> > 
> > 
> > -Hoss
> > 
> > 
> The ISOLatin1AccentFilter works well in the case above described by
> Lance Norskog, ie. for words containing characters with accents where
> the accented character is a single unicode character for the
> letter with
> the accent mark as in protégé. However in the data that we work with,
> often accented characters will be represented by a plain unaccented
> character followed by the Unicode combining character for the accent
> mark, roughly like this: prote'ge' which emerge from the
> ISOLatin1AccentFilter unchanged.
> 
> After some research I found the UnicodeNormalizationFilter mentioned
> above, which did not work on my development system (because it relies
> features only available in java 6), and which when combined with the
> DiacriticsFilter also mentioned above would remove diacritics from
> characters, but also discard any Chinese characters or Russian
> characters, or anything else outside the 0x0--0x7f range.
> Which is bad.
> 
> I first modified the filter to normalize the characters to
> the composed
> normalized form, (changing prote'ge' to protégé) and then pass the
> results through the ISOLatin1AccentFilter. However for accented
> characters for which there is no composed normailzed form
> (such as the n
> and s in Zarin̦š) the accents are not removed.
> 
> So I took the approach of decomposing the accented characters, and then
> only removing the valid diacritics and zero-width composing characters
> from the result, and the resulting filter works quite well. And since it
> was developed as a part of the blacklight project at the University of
> Virginia it is Open Source under the Apache License.
> 
> If anyone is interested in evaluating of using the
> UnicodeNormalizationFilter in conjunction with their Solr installation
> get the UnicodeNormalizeFilter.jar from:
> 
> http://blacklight.rubyforge.org/svn/trunk/solr/lib/
> 
> and place it in a lib directory next to the conf directory in
> your Solr
> home directory.
> 
> Robert Haschart
> 
> 
> 
> 
> 
> 
> 
>

Re: Solr Master Slave Architecture over NFS

2008-06-30 Thread Grant Ingersoll

I think it comes w/ some caveats, but is now workable (although it may  
not give great performance), assuming you're using 2.3 (2.2) or  
later.  I would definitely do a search in the Lucene archives about  
NFS, especially paying attention to Mike McCandless' comments.



On Jun 30, 2008, at 1:08 PM, Bill Au wrote:


Isn't using Lucene over NFS *not* recommended?

Bill

On Mon, Jun 30, 2008 at 4:27 AM, Nico Heid <[EMAIL PROTECTED]> wrote:


Hey, I'm looking for some feedback on the following setup.
Due to the architects decision I will be working with NFS not  
Solr's own

distribution scripts.

A few Solr indexing machines use Multicore to divide the 300.000  
Users to

1000
shards.
For several reasons we have to go with per user sharding (as you  
can see

300
per shard) Updates come in with about 166 updates per hour on each  
shard.

So
not a problem.

The question lies more in this concept: I set up a few Query  
Slaves, using

NFS
readonly mounts.
I do not use the index directory for the readonly slaves. I patched  
the

slaves
to use the most recent snapshot directory to avoid all the nasty nfs
issues.
(only a quick and dirty hack for testing) On a not yet defined  
interval I

do a
snapshot on the masters and send a http commit to the slave, so a new
reader
on the fresh snapshot is opened.
This seems to work without trouble so far, but I've not done  
extensive

testing.

To take this a step further (only an idea yet). I let the slaves  
work on

the
real index, as long as I do not optimize. Because the directory  
structure

is
not changing as long as I do not optimize, I can send commits to the
slaves.
Before I optimize I take a snapshot, send them a special "commit"  
to make

them
fall back to the most recent snapshot dir, optimize the index and  
send them

a
real commit when done.
Even though a little trickier I would be more up to date with the  
query

slaves.

So if you have any design comments or see major or minor flaws,  
feedback

would
be very welcome.

I do not use live data yet, this is the experimental stage. But  
I'll give
feedback on how it performs and what issues I run into. There's  
also the

faint
chance of letting this setup (or a "fixed" one) run on the real  
user data,
which would be roughly 20TB of usable data for indexing. This would  
be

really
interesting :-)

Have a nice week
Nico





--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Limit Porter stemmer to plural stemming only?

2008-06-30 Thread Mike Klaas

If you find a solution that works well, I encourage you to contribute  
it back to Solr.  Plural-only stemming is probably a common need (I've  
definitely wanted to use it before).


cheers,
-Mike

On 30-Jun-08, at 2:25 AM, climbingrose wrote:


Ok, it looks like step 1a in Porter algo does what I need.
On Mon, Jun 30, 2008 at 6:39 PM, climbingrose <[EMAIL PROTECTED]>
wrote:


Hi all,
Porter stemmer in general is really good. However, there are some  
cases
where it doesn't work. For example, "accountant" matches  
"Accountant" as
well as "Account Manager" which isn't desirable. Is it possible to  
use this

analyser for plural words only? For example:
+Accountant -> accountant
+Accountants -> accountant
+Account -> Account
+Accounts -> account

Thanks.

--
Regards,

Cuong Hoang





--
Regards,

Cuong Hoang

Re: Efficient date-based results sorting

2008-06-30 Thread Chris Hostetter



: Subject: Efficient date-based results sorting

Sorting on anything but score is done pretty much the exact same way 
regardless of data type. The one thing you can do to make any sorting on 
any field more efficient is to try and reduce the cardinality of the field 
-- ie: reduce the number of unique indexed terms.

With date based fields, that means that if you don't care about 
millisecond granularity when you sort by date, round to the nearest second 
when you index that field.  if you don't care about second granularity, 
round to the nearest minute, etc.

I suppose there is also this issue...

http://issues.apache.org/jira/browse/SOLR-440

...if someone implemnts a new DateField class that uses SortableLong as 
the underlying format instead of string you can be more *memory* 
efficient, but the speed of sorted queries will be about the same.

-Hoss

Re: Search query optimization

2008-06-30 Thread wojtekpia


If I know that condition C will eliminate more results than either A or B,
does specifying the query as: "C AND A AND B" make it any faster (than the
original "A AND B AND C")?
-- 
View this message in context: 
http://www.nabble.com/Search-query-optimization-tp17544667p18205504.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Search query optimization

2008-06-30 Thread Chris Hostetter


: If I know that condition C will eliminate more results than either A or B,
: does specifying the query as: "C AND A AND B" make it any faster (than the
: original "A AND B AND C")?

Nope.  Lucene takes care of that for you.



-Hoss

Re: Limit Porter stemmer to plural stemming only?

2008-06-30 Thread climbingrose

I modified the original English Stemmer written in Snowball language and
regenerate the Java implementation using Snowball compiler. It's been
working for me  so far. I certainly can share the modified Snowball English
Stemmer if anyone wants to use it.

Cheers,
Cuong

On Tue, Jul 1, 2008 at 4:12 AM, Mike Klaas <[EMAIL PROTECTED]> wrote:

> If you find a solution that works well, I encourage you to contribute it
> back to Solr.  Plural-only stemming is probably a common need (I've
> definitely wanted to use it before).
>
> cheers,
> -Mike
>
>
> On 30-Jun-08, at 2:25 AM, climbingrose wrote:
>
>  Ok, it looks like step 1a in Porter algo does what I need.
>> On Mon, Jun 30, 2008 at 6:39 PM, climbingrose <[EMAIL PROTECTED]>
>> wrote:
>>
>>  Hi all,
>>> Porter stemmer in general is really good. However, there are some cases
>>> where it doesn't work. For example, "accountant" matches "Accountant" as
>>> well as "Account Manager" which isn't desirable. Is it possible to use
>>> this
>>> analyser for plural words only? For example:
>>> +Accountant -> accountant
>>> +Accountants -> accountant
>>> +Account -> Account
>>> +Accounts -> account
>>>
>>> Thanks.
>>>
>>> --
>>> Regards,
>>>
>>> Cuong Hoang
>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Cuong Hoang
>>
>
>


-- 
Regards,

Cuong Hoang

analyzer index vs query vs {missing}

Solr Master Slave Architecture over NFS

RE: Benchmarking tools?

Limit Porter stemmer to plural stemming only?

Re: Benchmarking tools?

Re: Limit Porter stemmer to plural stemming only?

Re: analyzer index vs query vs {missing}

1.3 maven artifact

Re: Benchmarking tools?

Re: analyzer index vs query vs {missing}

Re: Benchmarking tools?

Minimum JDK for SolrJ?

Re: Minimum JDK for SolrJ?

Re: Benchmarking tools?

Re: Solr Master Slave Architecture over NFS

RE: UnicodeNormalizationFilterFactory

Re: Solr Master Slave Architecture over NFS

Re: Limit Porter stemmer to plural stemming only?

Re: Efficient date-based results sorting

Re: Search query optimization

Re: Search query optimization

Re: Limit Porter stemmer to plural stemming only?

22 matches

Site Navigation

Mail list logo

Footer information