Re: Some advice on scalability

2008-05-18 Thread Norberto Meijome
On Thu, 15 May 2008 12:54:25 -0700 (PDT)
Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

> 5) Hardware recommendations are hard to do.  While people may make 
> suggestions, the only way to know how *your* hardware works with *your* data 
> and *your* shards and *your* type of queries is by benchmarking.

Hi Otis,
is there a recommended set of benchmarks, or suggestions on best way to 
benchmark, other than replaying log files to mimic you users' behaviour? 

cheers,
B
_
{Beto|Norberto|Numard} Meijome

" An invasion of armies can be resisted, 
  but not an idea whose time has come."
  Victor Hugo

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Re: Some advice on scalability

2008-05-18 Thread Norberto Meijome
On Thu, 15 May 2008 09:23:03 -0700
"William Pierce" <[EMAIL PROTECTED]> wrote:

[...]
> 
> Our app in brief:  We get merchant sku files (in either xml/csv) which we 
> process and index and make available to our site visitors to search.   Our 
> current plan calls for us to support approx 10,000 merchants each with an 
> average of 50,000 sku's.   This will make a total of approx 500 Million SKUs. 
>  In addition,  we assume that on a daily basis approx 5-10% of the SKUs need 
> to be updated (either added/deleted/modified).   (Assume each sku will be 
> approx 4K)

[...]

> 
> b) Or, should we partition the 10,000 merchants into N buckets and have a 
> master index for each of the N buckets?   We could partition the merchants 
> depending on their type or some other simple algorithm.   Then,  we could 
> have slaves setup for each of the N masters.  The trick here will be to 
> partition the merchants carefully.  Ideally we would like a search for any 
> product type to hit only one index but this may not be possible always.   For 
> example, a search for "Harry Potter" may result in hits in "books", "dvds", 
> "memorabilia", etc etc.  
> 
> With N masters we will have to plan for having a distributed search across 
> the N indices (and then some mechanism for weighting the results across the 
> results that come back).   Any recommendations for a distributed search 
> solution?   

SOLR 1.3 supports it

> I saw some references to Katta.  Is this viable?

I was going to suggest that a Map reduce approach may be able to help -> Hadoop 
(or possibly even some other implementation of distributed computing). I didn't 
know of Katta , thanks for the reference. It seems that Katta is a full fledged 
integration between a lucene index an Hadoop - i am not sure where SOLR would 
sit in this solution. No idea how well developed Katta is. 
 
> In the extreme case, we could have one master for each of the merchants (if 
> there are 1 merchants there will be 10,000 master indices).   The 
> advantage here is that indices will have to be updated only for every 
> merchant who submits a new data file.  The others remain unchanged.

Not sure about this...gut feel tells me you'll be wasting lots of resources in 
containers rather than data...

Let us know what design you come up with :)

Cheers,
B
_
{Beto|Norberto|Numard} Meijome

"When the Paris Exhibition closes electric light will close with it and no more
be heard of." Erasmus Wilson (1878) Professor at Oxford University

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


solr feed problem

2008-05-18 Thread Cam Bazz
hello,

I am trying to feed solr with xml files of my own schema, and I am getting:

SEVERE: org.xmlpull.v1.XmlPullParserException: entity reference names can
not start with character '\ufffd'

my xml is utf8 for sure, as well as the text inside. but for some reason I
get this exception and then solr crashes.

Any ideas?

Best Regards,
-C.B.


Re: solr feed problem

2008-05-18 Thread Yonik Seeley
\ufffd isn't really a valid character.
http://www.fileformat.info/info/unicode/char/fffd/index.html
Your XML document or data probably had some kind of encoding issue
along the way somewhere.

-Yonik

On Sun, May 18, 2008 at 7:59 PM, Cam Bazz <[EMAIL PROTECTED]> wrote:
> hello,
>
> I am trying to feed solr with xml files of my own schema, and I am getting:
>
> SEVERE: org.xmlpull.v1.XmlPullParserException: entity reference names can
> not start with character '\ufffd'
>
> my xml is utf8 for sure, as well as the text inside. but for some reason I
> get this exception and then solr crashes.
>
> Any ideas?
>
> Best Regards,
> -C.B.
>


RE: solr feed problem

2008-05-18 Thread Steven A Rowe
Hi Cam,

On 05/18/2008 at 7:59 PM, Cam Bazz wrote:
> SEVERE: org.xmlpull.v1.XmlPullParserException: entity
> reference names can not start with character '\ufffd'

You likely have the sequence "&\ufffd" in a parsed character data section of a 
document, and the parser, seeing the ampersand, knows that either an entity 
reference name or a character reference beginning with '#' must follow.

XML entity reference names can only start with what the XML spec designates as 
Letter characters[1]; U+FFFD is not one of these.

If I'm right, then the document in question is not well-formed XML, and the XML 
spec requires parsers to fail under this circumstance.  Try using xmllint (part 
of the Gnome Libxml2 project[2]) or some other XML well-formedness checker, and 
you should see the same error.

Steve

[1] The XML spec Letter definition: 
[2] Gnome Libxml2 project: 


Auto commit and optimize settings

2008-05-18 Thread Vaijanath N. Rao

Hi Solr-Users,

I have gone through the solrConfig.xml file in the example directory of 
the solr build (nightly build). I wanted to know is there a way to tell 
solr to optimize the index after certain number of seconds elapsed or 
number of records indexed as we do in case of auto-commit.


--Thanks and Regards
Vaijanath




Re: Auto commit and optimize settings

2008-05-18 Thread Otis Gospodnetic
Hi,

There is no such option currently and it is not likely that such feature will 
be added because index optimization is not really a quick and lightweight 
operation, so one typically optimized only after the index is fully built and 
one knows the index will remain unchanged for a while.  If you do need to 
optimize periodically for some reason, just send optimize commands to Solr from 
your own application.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Vaijanath N. Rao <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Cc: [EMAIL PROTECTED]
> Sent: Monday, May 19, 2008 1:13:03 AM
> Subject: Auto commit and optimize settings
> 
> Hi Solr-Users,
> 
> I have gone through the solrConfig.xml file in the example directory of 
> the solr build (nightly build). I wanted to know is there a way to tell 
> solr to optimize the index after certain number of seconds elapsed or 
> number of records indexed as we do in case of auto-commit.
> 
> --Thanks and Regards
> Vaijanath



Re: Some advice on scalability

2008-05-18 Thread Otis Gospodnetic
Hi,

Not that I can think of at the moment.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Norberto Meijome <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Sunday, May 18, 2008 10:31:19 AM
> Subject: Re: Some advice on scalability
> 
> On Thu, 15 May 2008 12:54:25 -0700 (PDT)
> Otis Gospodnetic wrote:
> 
> > 5) Hardware recommendations are hard to do.  While people may make 
> suggestions, the only way to know how *your* hardware works with *your* data 
> and 
> *your* shards and *your* type of queries is by benchmarking.
> 
> Hi Otis,
> is there a recommended set of benchmarks, or suggestions on best way to 
> benchmark, other than replaying log files to mimic you users' behaviour? 
> 
> cheers,
> B
> _
> {Beto|Norberto|Numard} Meijome
> 
> " An invasion of armies can be resisted, 
>   but not an idea whose time has come."
>   Victor Hugo
> 
> I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
> Reading disclaimers makes you go blind. Writing them is worse. You have been 
> Warned.



Re: Auto commit and optimize settings

2008-05-18 Thread Vaijanath N. Rao

Hi Otis and Solr-users,

I was under the impression that when one call optimize all the indexes 
created so far get's merged. Hence I went about the question on optimize.


The reason I want optimize is that I have autoCommit feature in the 
solrConfig.xml to commit after every 1000 documents. Once I do that I 
get too many files open error after some time, while crawling and 
indexing a large number of sites.


Is there a way I can avoid too many files open issue all-together and 
yet have index committed after every 1000 docs.


--Thanks and Regards
Vaijanath

Otis Gospodnetic wrote:

Hi,

There is no such option currently and it is not likely that such feature will 
be added because index optimization is not really a quick and lightweight 
operation, so one typically optimized only after the index is fully built and 
one knows the index will remain unchanged for a while.  If you do need to 
optimize periodically for some reason, just send optimize commands to Solr from 
your own application.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
  

From: Vaijanath N. Rao <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Monday, May 19, 2008 1:13:03 AM
Subject: Auto commit and optimize settings

Hi Solr-Users,

I have gone through the solrConfig.xml file in the example directory of 
the solr build (nightly build). I wanted to know is there a way to tell 
solr to optimize the index after certain number of seconds elapsed or 
number of records indexed as we do in case of auto-commit.


--Thanks and Regards
Vaijanath