Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
Hi, we have an index of ~300GB, which is at least approaching the  
ballpark you're in.


Lucky for us, to coin a phrase we have an 'embarassingly  
partitionable' index so we can just scale out horizontally across  
commodity hardware with no problems at all. We're also using the  
multicore features available in development Solr version to reduce  
granularity of core size by an order of magnitude: this makes for lots  
of small commits, rather than few long ones.


There was mention somewhere in the thread of document collections: if  
you're going to be filtering by collection, I'd strongly recommend  
partitioning too. It makes scaling so much less painful!


James

On 8 May 2008, at 23:37, marcusherou wrote:



Hi.

I will as well head into a path like yours within some months from  
now.
Currently I have an index of ~10M docs and only store id's in the  
index for

performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G  
documents. Each

document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA  
enclosures
as shared storage (think of it as a SAN or shared storage at least,  
one
mount point). Hope this will be the right choice, only future can  
tell.


Since we are developing a search engine I frankly don't think even  
having
100's of SOLR instances serving the index will cut it performance  
wise if we
have one big index. I totally agree with the others claiming that  
you most
definitely will go OOE or hit some other constraints of SOLR if you  
must
have the whole result in memory sort it and create a xml response. I  
did hit
such constraints when I couldn't afford the instances to have enough  
memory
and I had only 1M of docs back then. And think of it... Optimizing a  
TB
index will take a long long time and you really want to have an  
optimized

index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index over  
the

disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler  
just a
properties file in each index dir) of some sort to know what docs is  
located

on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new  
shard the
master indexer will know what shard to prioritize. This is probably  
not
enough either since all new docs will go to the new shard until it  
is filled
(have the same size as the others) only then will all shards receive  
docs in

a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs  
from the
others until it reaches some sort of threshold (10 servers = each  
shard

should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I  
think
doing a distributed reindexing will as well be a good thing when it  
comes to
cutting both indexing and optimizing speed. Probably each indexer  
should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then  
just
copy it to the main index to minimize the burden of the shared  
storage.


Let's say the indexing part will be all fancy and working i TB scale  
now we
come to searching. I personally believe after talking to other guys  
which
have built big search engines that you need to introduce a  
controller like
searcher on the client side which itself searches in all of the  
shards and
merges the response. Perhaps Distributed Solr solves this and will  
love to
test it whenever my new installation of servers and enclosures is  
finished.


Currently my idea is something like this.
public Page search(SearchDocumentCommand sdc)
   {
   Set ids = documentIndexers.keySet();
   int nrOfSearchers = ids.size();
   int totalItems = 0;
   Page docs = new Page(sdc.getPage(),  
sdc.getPageSize());

   for (Iterator iterator = ids.iterator();
iterator.hasNext();)
   {
   Integer id = iterator.next();
   List indexers = documentIndexers.get(id);
   DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
   SearchDocumentCommand sdc2 = copy(sdc);
   sdc2.setPage(sdc.getPage()/nrOfSearchers);
   Page res = indexer.search(sdc);
   totalItems += res.getTotalItems();
   docs.addAll(res);
   }

   if(sdc.getComparator() != null)
   {
   Collections.sort(docs, sdc.getComparator());
   }

   docs.setTotalItems(totalItems);

   return docs;
   }

This is my RaidedDocumentIndexer which wraps a set of  
DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and  
comparing
stuff so I have two implementations of DocumentIndexer  
(SolrDocumentIndexer

and LuceneDocumentIndexer) to make the switch easy.

I think this approach is

Re: How Special Character '&' used in indexing

2008-05-09 Thread Shalin Shekhar Mangar
You need to XML encode special characters. Use & instead of &.

On Fri, May 9, 2008 at 12:07 PM, Ricky Martin <[EMAIL PROTECTED]>
wrote:

> Hello,
>
> I have a field name A & K Inc, which i
> cannot
> parse using XML to POST data to solr. When i search using A & K, i should
> be
> getting the exactly this field name.
>
> Please someone help me with this ASAP.
>
> Thanks,
> Ricky.
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Marcus Herou
Cool.

Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?

I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus



On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED]>
wrote:

> Hi, we have an index of ~300GB, which is at least approaching the ballpark
> you're in.
>
> Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
> index so we can just scale out horizontally across commodity hardware with
> no problems at all. We're also using the multicore features available in
> development Solr version to reduce granularity of core size by an order of
> magnitude: this makes for lots of small commits, rather than few long ones.
>
> There was mention somewhere in the thread of document collections: if
> you're going to be filtering by collection, I'd strongly recommend
> partitioning too. It makes scaling so much less painful!
>
> James
>
>
> On 8 May 2008, at 23:37, marcusherou wrote:
>
>
>> Hi.
>>
>> I will as well head into a path like yours within some months from now.
>> Currently I have an index of ~10M docs and only store id's in the index
>> for
>> performance and distribution reasons. When we enter a new market I'm
>> assuming we will soon hit 100M and quite soon after that 1G documents.
>> Each
>> document have in average about 3-5k data.
>>
>> We will use a GlusterFS installation with RAID1 (or RAID10) SATA
>> enclosures
>> as shared storage (think of it as a SAN or shared storage at least, one
>> mount point). Hope this will be the right choice, only future can tell.
>>
>> Since we are developing a search engine I frankly don't think even having
>> 100's of SOLR instances serving the index will cut it performance wise if
>> we
>> have one big index. I totally agree with the others claiming that you most
>> definitely will go OOE or hit some other constraints of SOLR if you must
>> have the whole result in memory sort it and create a xml response. I did
>> hit
>> such constraints when I couldn't afford the instances to have enough
>> memory
>> and I had only 1M of docs back then. And think of it... Optimizing a TB
>> index will take a long long time and you really want to have an optimized
>> index if you want to reduce search time.
>>
>> I am thinking of a sharding solution where I fragment the index over the
>> disk(s) and let each SOLR instance only have little piece of the total
>> index. This will require a master database or namenode (or simpler just a
>> properties file in each index dir) of some sort to know what docs is
>> located
>> on which machine or at least how many docs each shard have. This is to
>> ensure that whenever you introduce a new SOLR instance with a new shard
>> the
>> master indexer will know what shard to prioritize. This is probably not
>> enough either since all new docs will go to the new shard until it is
>> filled
>> (have the same size as the others) only then will all shards receive docs
>> in
>> a loadbalanced fashion. So whenever you want to add a new indexer you
>> probably need to initiate a "stealing" process where it steals docs from
>> the
>> others until it reaches some sort of threshold (10 servers = each shard
>> should have 1/10 of the docs or such).
>>
>> I think this will cut it and enabling us to grow with the data. I think
>> doing a distributed reindexing will as well be a good thing when it comes
>> to
>> cutting both indexing and optimizing speed. Probably each indexer should
>> buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
>> copy it to the main index to minimize the burden of the shared storage.
>>
>> Let's say the indexing part will be all fancy and working i TB scale now
>> we
>> come to searching. I personally believe after talking to other guys which
>> have built big search engines that you need to introduce a controller like
>> searcher on the client side which itself searches in all of the shards and
>> merges the response. Perhaps Distributed Solr solves this and will love to
>> test it whenever my new installation of servers and enclosures is
>> finished.
>>
>> Currently my idea is something like this.
>> public Page search(SearchDocumentCommand sdc)
>>   {
>>   Set ids = documentIndexers.keySet();
>>   int nrOfSearchers = ids.size();
>>   int totalItems = 0;
>>   Page docs = new Page(sdc.getPage(), sdc.getPageSize());
>>   for (Iterator iterator = ids.iterator();
>> iterator.hasNext();)
>>   {
>>   Integer id = iterator.next();
>>   List indexers = documentIndexers.get(id);
>>   DocumentIndexer indexer =
>> indexers.get(random.nextInt(indexers.size()));
>>   SearchDocumentCommand sdc2 = copy(sdc);
>>   

Weird problems with document size

2008-05-09 Thread Andrew Savory
Hi,

I'm trying to debug a misbehaving solr search setup. Here's the scenario:

- custom index client that posts insert/delete events to solr via http;
- custom content handlers in solr;
- tcpmon in the middle to see what's going on

When I post an add event to solr of less than about 5k, everything works:





When I post a larger event, it goes wrong. The response from solr is a
500 server error (text of which is below).

The content should be good - it's lorem ipsum.
The tomcat server has maxPostSize disabled
The solr config has field size set to a large number (and we've tested
with several big fields less than the limit, as well as one big field
- anything over 5k trips it regardless of how the data is stored in
the fields)

I've also tried pushing the same content using the command line and
curl - with the same result.

At this point I'm baffled - any suggestions?


Those pesky errors:

java.io.EOFException: no more data available - expected end tags
 to close start tag
 from line 1 and start tag  from line 1 and
start tag  from line 1, parser stopped on START_TAG seen
...ipsum\tDolor sit amet\tlorem ipsum\tfoo\tbar... @1:8192
at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:3015)
at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdateRequestHandler.java:355)
at 
org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:58)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)


HTTP/1.1 500 Internal Server Error
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=utf-8
Content-Length: 2509
Date: Fri, 09 May 2008 10:03:46 GMT
Connection: close



Apache Tomcat/6.0.16 - Error
report HTTP Status 500 - type Exception
reportmessage description The
server encountered an internal error () that prevented it from
fulfilling this request.exception
java.lang.RuntimeException:
org.xmlpull.v1.XmlPullParserException: only whitespace content allowed
before start tag and not t (position: START_DOCUMENT seen t... @1:1)

com.wiley.wps.search.common.impl.servlet.SolrServletConfig.process(SolrServletConfig.java:106)

com.wiley.wps.search.common.impl.servlet.SolrServletConfig.doPost(SolrServletConfig.java:63)
javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)

org.apache.solr.servlet.SolrDispatchFilter.doFilter(Sol

Multilingual Search

2008-05-09 Thread Sachit P. Menon
Can we have a multilingual search using Solr

 

Thanks and Regards

Sachit P. Menon| Programmer Analyst| MindTree Ltd. |West Campus, Phase-1,
Global Village, RVCE Post, Mysore Road, Bangalore-560 059, INDIA |Voice +91
80 26264000 |Extn  65377|Fax +91 80 26264100 | Mob : +91
9986747356|www.mindtree.com
  |

 



DISCLAIMER:
This message (including attachment if any) is confidential and may be 
privileged. If you have received this message by mistake please notify the 
sender by return e-mail and delete this message from your system. Any 
unauthorized use or dissemination of this message in whole or in part is 
strictly prohibited.
E-mail may contain viruses. Before opening attachments please check them for 
viruses and defects. While MindTree Limited (MindTree) has put in place checks 
to minimize the risks, MindTree will not be responsible for any viruses or 
defects or any forwarded attachments emanating either from within MindTree or 
outside.
Please note that e-mails are susceptible to change and MindTree shall not be 
liable for any improper, untimely or incomplete transmission.
MindTree reserves the right to monitor and review the content of all messages 
sent to or from MindTree e-mail address. Messages sent to or from this e-mail 
address may be stored on the MindTree e-mail system or else where.


Re: Multilingual Search

2008-05-09 Thread Grant Ingersoll
Yes.  Solr handles UTF-8 and has many analyzers for non-English  
languages.


-Grant

On May 9, 2008, at 7:23 AM, Sachit P. Menon wrote:


Can we have a multilingual search using Solr



Thanks and Regards

Sachit P. Menon| Programmer Analyst| MindTree Ltd. |West Campus,  
Phase-1,
Global Village, RVCE Post, Mysore Road, Bangalore-560 059, INDIA | 
Voice +91

80 26264000 |Extn  65377|Fax +91 80 26264100 | Mob : +91
9986747356|www.mindtree.com
  |





DISCLAIMER:
This message (including attachment if any) is confidential and may  
be privileged. If you have received this message by mistake please  
notify the sender by return e-mail and delete this message from your  
system. Any unauthorized use or dissemination of this message in  
whole or in part is strictly prohibited.
E-mail may contain viruses. Before opening attachments please check  
them for viruses and defects. While MindTree Limited (MindTree) has  
put in place checks to minimize the risks, MindTree will not be  
responsible for any viruses or defects or any forwarded attachments  
emanating either from within MindTree or outside.
Please note that e-mails are susceptible to change and MindTree  
shall not be liable for any improper, untimely or incomplete  
transmission.
MindTree reserves the right to monitor and review the content of all  
messages sent to or from MindTree e-mail address. Messages sent to  
or from this e-mail address may be stored on the MindTree e-mail  
system or else where.


--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ








Loading performance slowdown at ~ 400K documents

2008-05-09 Thread Tracy Flynn

Hi,

I'm starting to see significant slowdown in loading performance after  
I have loaded about 400K documents.  I go from a load rate of near 40  
docs/sec to 20- 25 docs a second.


Am I correct in assuming that, during indexing operations, Lucene/SOLR  
tries to hold as much of the indexex in memory as possible? If so,  
does the slowdown indicate need to increase JVM heap space?


Any ideas / help would be appreciated

Regards,

Tracy

-

Details

Documents loaded as XML via POST command in batches of 1000, commit  
after each batch


Total current documents ~ 450,000
Avg document size: 4KB
One indexed text field contains 3KB or so. (body field below -  
standard type 'text')


Dual XEON 3 GHZ 4 GB memory

SOLR JVM Startup options

java -Xms256m -Xmx1000m  -jar start.jar


Relevant portion of the schema follows


   stored="true" required="true"/>
   required="false"/>
   required="false"/>
   
   stored="true" required="false" default="0"/>
   stored="true" required="true"/>
   required="false"/>
   required="false" compressed="true"/>
   required="false"/>
   stored="true" required="false" default="0"/>
   required="false"/>
   required="false" default="0"/>
   stored="true" required="false" default="0"/>
   required="false" default="0"/>
   required="false"/>
   required="false"/>
   stored="true" required="false" multiValued="true"/>
   required="false" default="0"/>
   stored="true" required="false" default="0"/>
   stored="true" required="false"/>
   stored="true" required="false" multiValued="true"/>
   stored="true" required="false"/>
   required="false" default="0"/>
   stored="true" required="false"/>
   stored="true" required="false" default="0"/>
   stored="true" required="false"/>
   indexed="true" stored="true" required="false"/>
   indexed="true" stored="true" required="false"/>


   stored="true" required="false" />





Re: Loading performance slowdown at ~ 400K documents

2008-05-09 Thread Nick Jenkin
Hi Tracy
Do you have autocommit enabled (or are you manually commiting every
few thousand docs?)
If not try that.
-Nick
On 5/10/08, Tracy Flynn <[EMAIL PROTECTED]> wrote:
> Hi,
>
>  I'm starting to see significant slowdown in loading performance after I
> have loaded about 400K documents.  I go from a load rate of near 40 docs/sec
> to 20- 25 docs a second.
>
>  Am I correct in assuming that, during indexing operations, Lucene/SOLR
> tries to hold as much of the indexex in memory as possible? If so, does the
> slowdown indicate need to increase JVM heap space?
>
>  Any ideas / help would be appreciated
>
>  Regards,
>
>  Tracy
>
> -
>
>  Details
>
>  Documents loaded as XML via POST command in batches of 1000, commit after
> each batch
>
>  Total current documents ~ 450,000
>  Avg document size: 4KB
>  One indexed text field contains 3KB or so. (body field below - standard
> type 'text')
>
>  Dual XEON 3 GHZ 4 GB memory
>
>  SOLR JVM Startup options
>
>  java -Xms256m -Xmx1000m  -jar start.jar
>
>
>  Relevant portion of the schema follows
>
>
> required="true"/>
> required="false"/>
> required="false"/>
>
> required="false" default="0"/>
> required="true"/>
> required="false"/>
> required="false" compressed="true"/>
> required="false"/>
> stored="true" required="false" default="0"/>
> required="false"/>
> required="false" default="0"/>
> required="false" default="0"/>
> required="false" default="0"/>
> required="false"/>
> required="false"/>
> required="false" multiValued="true"/>
> required="false" default="0"/>
> required="false" default="0"/>
> required="false"/>
> required="false" multiValued="true"/>
> required="false"/>
> required="false" default="0"/>
> required="false"/>
> required="false" default="0"/>
> stored="true" required="false"/>
> stored="true" required="false"/>
> type="string" indexed="true" stored="true" required="false"/>
> 
> required="false" />
>
>
>


Solr hardware specs

2008-05-09 Thread dudes dudes

Hello, 

Can someone kindly advice me on hardware specs (CPU/HHD/RAM) to install solr on 
a production server ? We are planning to have it 
on Debian. Also what network connectivities does it require (incoming and 
outgoing)?

Thanks fr your time.
ak 
_
Great deals on almost anything at eBay.co.uk. Search, bid, find and win on eBay 
today!
http://clk.atdmt.com/UKM/go/msnnkmgl001004ukm/direct/01/

Re: Solr hardware specs

2008-05-09 Thread Nick Jenkin
Hi
It all depends on the load your server is under, how many documents
you have etc. -- I am not sure what you mean by network connectivity
-- solr really should not be run on a publicly accessible IP address.

Can you provide some more info on the setup?
-Nick
On 5/10/08, dudes dudes <[EMAIL PROTECTED]> wrote:
>
>  Hello,
>
>  Can someone kindly advice me on hardware specs (CPU/HHD/RAM) to install solr 
> on a production server ? We are planning to have it
>  on Debian. Also what network connectivities does it require (incoming and 
> outgoing)?
>
>  Thanks fr your time.
>  ak
>  _
>  Great deals on almost anything at eBay.co.uk. Search, bid, find and win on 
> eBay today!
>  http://clk.atdmt.com/UKM/go/msnnkmgl001004ukm/direct/01/


RE: Solr hardware specs

2008-05-09 Thread dudes dudes

HI Nick, 

I'm quite new to solr, so excuse my ignorance for any solr related settings :).
We think that would have up to 400K docs in a loady environment. We surely 
don't want to have solr to be publicly 
accessible ( Just for the internal use). We are not sure if we could have 2 
network interfaces will help the speed issues that we have 
due to side location of the company. 

thanks, 
ak

> Date: Sat, 10 May 2008 00:13:49 +1200
> From: [EMAIL PROTECTED]
> To: solr-user@lucene.apache.org
> Subject: Re: Solr hardware specs
> 
> Hi
> It all depends on the load your server is under, how many documents
> you have etc. -- I am not sure what you mean by network connectivity
> -- solr really should not be run on a publicly accessible IP address.
> 
> Can you provide some more info on the setup?
> -Nick
> On 5/10/08, dudes dudes  wrote:
>>
>>  Hello,
>>
>>  Can someone kindly advice me on hardware specs (CPU/HHD/RAM) to install 
>> solr on a production server ? We are planning to have it
>>  on Debian. Also what network connectivities does it require (incoming and 
>> outgoing)?
>>
>>  Thanks fr your time.
>>  ak
>>  _
>>  Great deals on almost anything at eBay.co.uk. Search, bid, find and win on 
>> eBay today!
>>  http://clk.atdmt.com/UKM/go/msnnkmgl001004ukm/direct/01/

_
Be a Hero and Win with Iron Man
http://clk.atdmt.com/UKM/go/msnnkmgl001009ukm/direct/01/

Re: Unlimited number of return documents?

2008-05-09 Thread Francisco Sanmartin
Yeah, I understand the possible problems of changing this value. It's 
just a very particular case and there won't be a lot of documents to 
return. I guess I'll have to use a very high int number, I just wanted 
to know if there was any "proper" configuration for this situation.


Thanks for the answer!

Pako


Otis Gospodnetic wrote:

Will something a la rows= work? ;) But are you sure you want to 
do that?  It could be sloow.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
  

From: Francisco Sanmartin <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, May 8, 2008 4:18:46 PM
Subject: Unlimited number of return documents?

What is the value to set to "rows" in solrconfig.xml in order not to have any 
limitation about the number of returned documents? I've tried with "-1" and "0" 
but not luck...


solr 0 
name="rows">*10* 


I want solr to return all available documents by default.

Thanks!

Pako





  




Re: Unlimited number of return documents?

2008-05-09 Thread Erik Hatcher
Or make two requests...  one with rows=0 to see how many documents  
match without retrieving any, then another with that amount specified.


Erik


On May 9, 2008, at 8:54 AM, Francisco Sanmartin wrote:
Yeah, I understand the possible problems of changing this value.  
It's just a very particular case and there won't be a lot of  
documents to return. I guess I'll have to use a very high int  
number, I just wanted to know if there was any "proper"  
configuration for this situation.


Thanks for the answer!

Pako


Otis Gospodnetic wrote:
Will something a la rows= work? ;) But are you sure  
you want to do that?  It could be sloow.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 


From: Francisco Sanmartin <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, May 8, 2008 4:18:46 PM
Subject: Unlimited number of return documents?

What is the value to set to "rows" in solrconfig.xml in order not  
to have any limitation about the number of returned documents?  
I've tried with "-1" and "0" but not luck...


solr 0 name="rows">*10*
I want solr to return all available documents by default.

Thanks!

Pako










Re: Solr hardware specs

2008-05-09 Thread Erick Erickson
This still isn't very helpful. How big are the docs? How many fields do you
expect to index? What is your expected query rate?

You can get away with an old laptop if your docs are, say, 5K each and you
only
expect to query it once a day and have one text field.

If each doc is 10M, you're indexing 250 fields and your expected
query rate is 100/sec, you need some serious hardware.

But 400K docs isn't very big by SOLR standards in terms of the number
of docs.

What I'd really recommend is that you just take an existing machine, create
an
index on it and measure. Be aware that the first few queries will be much
slower
than subsequent queries, so throw out the first few queries from your
timings.

Best
Erick

On Fri, May 9, 2008 at 8:27 AM, dudes dudes <[EMAIL PROTECTED]> wrote:

>
> HI Nick,
>
> I'm quite new to solr, so excuse my ignorance for any solr related settings
> :).
> We think that would have up to 400K docs in a loady environment. We surely
> don't want to have solr to be publicly
> accessible ( Just for the internal use). We are not sure if we could have 2
> network interfaces will help the speed issues that we have
> due to side location of the company.
>
> thanks,
> ak
> 
> > Date: Sat, 10 May 2008 00:13:49 +1200
> > From: [EMAIL PROTECTED]
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr hardware specs
> >
> > Hi
> > It all depends on the load your server is under, how many documents
> > you have etc. -- I am not sure what you mean by network connectivity
> > -- solr really should not be run on a publicly accessible IP address.
> >
> > Can you provide some more info on the setup?
> > -Nick
> > On 5/10/08, dudes dudes  wrote:
> >>
> >>  Hello,
> >>
> >>  Can someone kindly advice me on hardware specs (CPU/HHD/RAM) to install
> solr on a production server ? We are planning to have it
> >>  on Debian. Also what network connectivities does it require (incoming
> and outgoing)?
> >>
> >>  Thanks fr your time.
> >>  ak
> >>  _
> >>  Great deals on almost anything at eBay.co.uk. Search, bid, find and
> win on eBay today!
> >>  http://clk.atdmt.com/UKM/go/msnnkmgl001004ukm/direct/01/
>
> _
> Be a Hero and Win with Iron Man
> http://clk.atdmt.com/UKM/go/msnnkmgl001009ukm/direct/01/


Re: How Special Character '&' used in indexing

2008-05-09 Thread Ricky
I have tried sending the '&' instead of '&' like the following,
A & K Inc.

But i still get the same error ""entity reference name can not contain
character  ' A & ..

Please kindly reply ASAP.


Thanks,
Ricky

On Fri, May 9, 2008 at 3:48 AM, Shalin Shekhar Mangar <
[EMAIL PROTECTED]> wrote:

> You need to XML encode special characters. Use & instead of &.
>
> On Fri, May 9, 2008 at 12:07 PM, Ricky Martin <[EMAIL PROTECTED]>
> wrote:
>
> > Hello,
> >
> > I have a field name A & K Inc, which i
> > cannot
> > parse using XML to POST data to solr. When i search using A & K, i should
> > be
> > getting the exactly this field name.
> >
> > Please someone help me with this ASAP.
> >
> > Thanks,
> > Ricky.
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: How Special Character '&' used in indexing

2008-05-09 Thread Alan Rykhus
&

you're missing the ;



On Fri, 2008-05-09 at 08:26 -0500, Ricky wrote:
> I have tried sending the '&' instead of '&' like the following,
> A & K Inc.
> 
> But i still get the same error ""entity reference name can not contain
> character  ' A & ..
> 
> Please kindly reply ASAP.
> 
> 
> Thanks,
> Ricky
> 
> On Fri, May 9, 2008 at 3:48 AM, Shalin Shekhar Mangar <
> [EMAIL PROTECTED]> wrote:
> 
> > You need to XML encode special characters. Use & instead of &.
> >
> > On Fri, May 9, 2008 at 12:07 PM, Ricky Martin <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hello,
> > >
> > > I have a field name A & K Inc, which i
> > > cannot
> > > parse using XML to POST data to solr. When i search using A & K, i should
> > > be
> > > getting the exactly this field name.
> > >
> > > Please someone help me with this ASAP.
> > >
> > > Thanks,
> > > Ricky.
> > >
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
-- 
Alan Rykhus
PALS, A Program of the Minnesota State Colleges and Universities 
(507)389-1975
[EMAIL PROTECTED]



Re: How Special Character '&' used in indexing

2008-05-09 Thread Erick Erickson
I don't see a semi-colon at the end of your entity reference, is that a
typo?
i.e. &


On Fri, May 9, 2008 at 9:26 AM, Ricky <[EMAIL PROTECTED]> wrote:

> I have tried sending the '&' instead of '&' like the following,
> A & K Inc.
>
> But i still get the same error ""entity reference name can not contain
> character  ' A & ..
>
> Please kindly reply ASAP.
>
>
> Thanks,
> Ricky
>
> On Fri, May 9, 2008 at 3:48 AM, Shalin Shekhar Mangar <
> [EMAIL PROTECTED]> wrote:
>
> > You need to XML encode special characters. Use & instead of &.
> >
> > On Fri, May 9, 2008 at 12:07 PM, Ricky Martin <[EMAIL PROTECTED]
> >
> > wrote:
> >
> > > Hello,
> > >
> > > I have a field name A & K Inc, which i
> > > cannot
> > > parse using XML to POST data to solr. When i search using A & K, i
> should
> > > be
> > > getting the exactly this field name.
> > >
> > > Please someone help me with this ASAP.
> > >
> > > Thanks,
> > > Ricky.
> > >
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>


Re: How Special Character '&' used in indexing

2008-05-09 Thread Ricky
Thanks all,

I got it, its &

/Ricky

On Fri, May 9, 2008 at 9:38 AM, Erick Erickson <[EMAIL PROTECTED]>
wrote:

> I don't see a semi-colon at the end of your entity reference, is that a
> typo?
> i.e. &
>
>
> On Fri, May 9, 2008 at 9:26 AM, Ricky <[EMAIL PROTECTED]> wrote:
>
> > I have tried sending the '&' instead of '&' like the following,
> > A & K Inc.
> >
> > But i still get the same error ""entity reference name can not contain
> > character  ' A &
> ..
> >
> > Please kindly reply ASAP.
> >
> >
> > Thanks,
> > Ricky
> >
> > On Fri, May 9, 2008 at 3:48 AM, Shalin Shekhar Mangar <
> > [EMAIL PROTECTED]> wrote:
> >
> > > You need to XML encode special characters. Use & instead of &.
> > >
> > > On Fri, May 9, 2008 at 12:07 PM, Ricky Martin <
> [EMAIL PROTECTED]
> > >
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I have a field name A & K Inc, which i
> > > > cannot
> > > > parse using XML to POST data to solr. When i search using A & K, i
> > should
> > > > be
> > > > getting the exactly this field name.
> > > >
> > > > Please someone help me with this ASAP.
> > > >
> > > > Thanks,
> > > > Ricky.
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
>


Missing content Stream

2008-05-09 Thread Ricky
Hello,

Am a newbie to SOLR. I am trying to learn it now. i have downloaded
apache-solr 1.2.0.zip file. I have tried the examples in the exampledocs of
solr 1.2. The xml file examples are working fine. Able to index them also.
But i could not get the result for csv file i.e books.csv. I am getting the
error  *
org.apache.solr.core.SolrException: missing content stream
at
org.apache.solr.handler.CSVRequestHandler.handleRequestBody(CSVRequesthandler.java:50)
*
.. ...
...
and getting the the contents of the csv file on the console (where i run
solr start.jar file(Am using Windows XP/DOS prompt))

What i have done is,

Changed the following in solrconfig.xml,



Used the following to send CSV data to solr,
c:/solr1.2/example/exampledocs>*curl
http://localhost:8983/solr/update/csv--data-binary @books.csv -H
'Content-type:text/plain; charset=utf-8'


*Please give me a quick reply as am planning to use solr in my present
project.

Thanks in Advance,
Ricky


Re: Missing content Stream

2008-05-09 Thread Ryan McKinley

make sure you are following all the directions on:
http://wiki.apache.org/solr/UpdateCSV

in particular check "Methods of uploading CSV records"


On May 9, 2008, at 9:58 AM, Ricky wrote:

Hello,

Am a newbie to SOLR. I am trying to learn it now. i have downloaded
apache-solr 1.2.0.zip file. I have tried the examples in the  
exampledocs of
solr 1.2. The xml file examples are working fine. Able to index them  
also.
But i could not get the result for csv file i.e books.csv. I am  
getting the

error  *
org.apache.solr.core.SolrException: missing content stream
at
org 
.apache 
.solr 
.handler.CSVRequestHandler.handleRequestBody(CSVRequesthandler.java: 
50)

*
.. ...
...
and getting the the contents of the csv file on the console (where i  
run

solr start.jar file(Am using Windows XP/DOS prompt))

What i have done is,

Changed the following in solrconfig.xml,
multipartUploadLimitInKB="2048"

/>


Used the following to send CSV data to solr,
c:/solr1.2/example/exampledocs>*curl
http://localhost:8983/solr/update/csv--data-binary @books.csv -H
'Content-type:text/plain; charset=utf-8'


*Please give me a quick reply as am planning to use solr in my present
project.

Thanks in Advance,
Ricky




Solr Multicore, are there any way to retrieve all the cores registered?

2008-05-09 Thread Walter Ferrara
In solr, last trunk version in svn, is it possible to access the "core 
registry", or what used to be the static MultiCore object? My goal is to 
retrieve all the cores registered in a given (multicore) enviroment.
It used to be MultiCore.getRegistry() initially, at first stages of 
solr-350; but now MultiCore is static no more, and I don't find any 
reference to where pick up the multicore object initialized from 
multicore.xml.


any tips on how to retrieve such information now?
Walter



Re: Missing content Stream

2008-05-09 Thread Ricky
Yes, i have followed the directions on http://wiki.apache.org/solr/UpdateCSV.

i am learning Solr from the mentioned webpage.

Can it be a problem with CURL?

/Rickey

On Fri, May 9, 2008 at 10:15 AM, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> make sure you are following all the directions on:
> http://wiki.apache.org/solr/UpdateCSV
>
> in particular check "Methods of uploading CSV records"
>
>
>
> On May 9, 2008, at 9:58 AM, Ricky wrote:.
>
>> Hello,
>>
>> Am a newbie to SOLR. I am trying to learn it now. i have downloaded
>> apache-solr 1.2.0.zip file. I have tried the examples in the exampledocs
>> of
>> solr 1.2. The xml file examples are working fine. Able to index them also.
>> But i could not get the result for csv file i.e books.csv. I am getting
>> the
>> error  *
>> org.apache.solr.core.SolrException: missing content stream
>> at
>>
>> org.apache.solr.handler.CSVRequestHandler.handleRequestBody(CSVRequesthandler.java:50)
>> *
>> .. ...
>> ...
>> and getting the the contents of the csv file on the console (where i run
>> solr start.jar file(Am using Windows XP/DOS prompt))
>>
>> What i have done is,
>>
>> Changed the following in solrconfig.xml,
>> > multipartUploadLimitInKB="2048"
>> />
>> > startup="lazy" />
>>
>> Used the following to send CSV data to solr,
>> c:/solr1.2/example/exampledocs>*curl
>> http://localhost:8983/solr/update/csv--data-binary @books.csv -H
>> 'Content-type:text/plain; charset=utf-8'
>>
>>
>> *Please give me a quick reply as am planning to use solr in my present
>> project.
>>
>> Thanks in Advance,
>> Ricky
>>
>
>


Re: Solr Multicore, are there any way to retrieve all the cores registered?

2008-05-09 Thread Ryan McKinley

check the "status" action

also, check the index.jsp page

(i don't have the code in front of me)


On May 9, 2008, at 10:16 AM, Walter Ferrara wrote:
In solr, last trunk version in svn, is it possible to access the  
"core registry", or what used to be the static MultiCore object? My  
goal is to retrieve all the cores registered in a given (multicore)  
enviroment.
It used to be MultiCore.getRegistry() initially, at first stages of  
solr-350; but now MultiCore is static no more, and I don't find any  
reference to where pick up the multicore object initialized from  
multicore.xml.


any tips on how to retrieve such information now?
Walter





JSON updates?

2008-05-09 Thread kirk beers
Hi folks,

I was wondering if xml is the only format used for updating Solr documents
or can JSON or Ruby be used as well ?

K


Re: Solr Multicore, are there any way to retrieve all the cores registered?

2008-05-09 Thread Walter Ferrara

Ryan McKinley wrote:

check the "status" action

also, check the index.jsp page

index.jsp do:
org.apache.solr.core.MultiCore multicore = 
(org.apache.solr.core.MultiCore)request.getAttribute("org.apache.solr.MultiCore");


which is ok in a servlet, but how should I do the same inside an 
handler, i.e. having just SolrQueryRequest and SolrQueryResponse? Is it 
something that can be extracted from SolrQueryRequest.getContext? and, 
in the perspective of solr 1.3, will this functionality be maintained?


thanks,
Walter



Re: JSON updates?

2008-05-09 Thread Otis Gospodnetic
Hi,

Input is XML only, I believe.  It's the output that can be XML or JSON or...


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: kirk beers <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 10:59:22 AM
> Subject: JSON updates?
> 
> Hi folks,
> 
> I was wondering if xml is the only format used for updating Solr documents
> or can JSON or Ruby be used as well ?
> 
> K



Re: Weird problems with document size

2008-05-09 Thread Otis Gospodnetic
Andrew,

I don't understand what that lock and unlock is for...
Just do this:
add
add
add
add
...
...
optionally commit or optimize


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Andrew Savory <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 6:28:09 AM
> Subject: Weird problems with document size
> 
> Hi,
> 
> I'm trying to debug a misbehaving solr search setup. Here's the scenario:
> 
> - custom index client that posts insert/delete events to solr via http;
> - custom content handlers in solr;
> - tcpmon in the middle to see what's going on
> 
> When I post an add event to solr of less than about 5k, everything works:
> 
> 
> 
> 
> 
> When I post a larger event, it goes wrong. The response from solr is a
> 500 server error (text of which is below).
> 
> The content should be good - it's lorem ipsum.
> The tomcat server has maxPostSize disabled
> The solr config has field size set to a large number (and we've tested
> with several big fields less than the limit, as well as one big field
> - anything over 5k trips it regardless of how the data is stored in
> the fields)
> 
> I've also tried pushing the same content using the command line and
> curl - with the same result.
> 
> At this point I'm baffled - any suggestions?
> 
> 
> Those pesky errors:
> 
> java.io.EOFException: no more data available - expected end tags
>  to close start tag
>  from line 1 and start tag  from line 1 and
> start tag  from line 1, parser stopped on START_TAG seen
> ...ipsum\tDolor sit amet\tlorem ipsum\tfoo\tbar... @1:8192
> at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:3015)
> at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
> at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
> at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
> at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdateRequestHandler.java:355)
> at 
> org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:58)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
> at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
> at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
> at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
> at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
> at java.lang.Thread.run(Thread.java:619)
> 
> 
> HTTP/1.1 500 Internal Server Error
> Server: Apache-Coyote/1.1
> Content-Type: text/html;charset=utf-8
> Content-Length: 2509
> Date: Fri, 09 May 2008 10:03:46 GMT
> Connection: close
> 
> 
> 
Apache Tomcat/6.0.16 - Error> report>  
HTTP Status 500 - 


> size="1" noshade="noshade">
type Exception
> report
message 
description The
> server encountered an internal error () that prevented it from
> fulfilling this request.
exception
> 
java.lang.RuntimeException:
> org.xmlpull.v1.XmlPullParserException: only whitespace content allowed
> before start tag and not t (position: START_DOCUMENT seen t... @1:1)
> 
> com.wiley.wps.search.common.impl.servlet.SolrServletConfig.process(SolrServletConfig.java:106)
> 
> com.wiley.wps.search.common.impl.servlet.SolrServletConfig.doPost(SolrServletConfig.java:63)
> javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
> javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
> 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:185)
> 
root cause
> 
org.xmlpull.v1.XmlPullParserException:

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
Marcus,

You are headed in the right direction.

We've built a system like this at Technorati (Lucene, not Solr) and had 
components like the "namenode" or "controller" that you mention.  If you look 
at Hadoop project, you will see something similar in concept (NameNode), though 
it deals with raw data blocks, their placement in the cluster, etc.  As a 
matter of fact, I am currently running its "re-balancer" in order to move some 
of the blocks around in the cluster.  That matches what you are describing for 
moving documents from one shard to the other.  Of course, you can simplify 
things and just have this central piece be aware of any new servers and simply 
get it to place any new docs on the new servers and create a new shard there.  
Or you can get fancy and take into consideration the hardware resources - the 
CPU, the disk space, the memory, and use that to figure out how much each 
machine in your cluster can handle and maximize its use based on this 
knowledge. :)

I think Solr and Nutch are in a desperate need of this central component (must 
not be SPOF!) for shard management.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: marcusherou <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 2:37:19 AM
> Subject: Re: Solr feasibility with terabyte-scale data
> 
> 
> Hi.
> 
> I will as well head into a path like yours within some months from now.
> Currently I have an index of ~10M docs and only store id's in the index for
> performance and distribution reasons. When we enter a new market I'm
> assuming we will soon hit 100M and quite soon after that 1G documents. Each
> document have in average about 3-5k data.
> 
> We will use a GlusterFS installation with RAID1 (or RAID10) SATA enclosures
> as shared storage (think of it as a SAN or shared storage at least, one
> mount point). Hope this will be the right choice, only future can tell.
> 
> Since we are developing a search engine I frankly don't think even having
> 100's of SOLR instances serving the index will cut it performance wise if we
> have one big index. I totally agree with the others claiming that you most
> definitely will go OOE or hit some other constraints of SOLR if you must
> have the whole result in memory sort it and create a xml response. I did hit
> such constraints when I couldn't afford the instances to have enough memory
> and I had only 1M of docs back then. And think of it... Optimizing a TB
> index will take a long long time and you really want to have an optimized
> index if you want to reduce search time.
> 
> I am thinking of a sharding solution where I fragment the index over the
> disk(s) and let each SOLR instance only have little piece of the total
> index. This will require a master database or namenode (or simpler just a
> properties file in each index dir) of some sort to know what docs is located
> on which machine or at least how many docs each shard have. This is to
> ensure that whenever you introduce a new SOLR instance with a new shard the
> master indexer will know what shard to prioritize. This is probably not
> enough either since all new docs will go to the new shard until it is filled
> (have the same size as the others) only then will all shards receive docs in
> a loadbalanced fashion. So whenever you want to add a new indexer you
> probably need to initiate a "stealing" process where it steals docs from the
> others until it reaches some sort of threshold (10 servers = each shard
> should have 1/10 of the docs or such).
> 
> I think this will cut it and enabling us to grow with the data. I think
> doing a distributed reindexing will as well be a good thing when it comes to
> cutting both indexing and optimizing speed. Probably each indexer should
> buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
> copy it to the main index to minimize the burden of the shared storage.
> 
> Let's say the indexing part will be all fancy and working i TB scale now we
> come to searching. I personally believe after talking to other guys which
> have built big search engines that you need to introduce a controller like
> searcher on the client side which itself searches in all of the shards and
> merges the response. Perhaps Distributed Solr solves this and will love to
> test it whenever my new installation of servers and enclosures is finished.
> 
> Currently my idea is something like this.
> public Pagesearch(SearchDocumentCommand sdc)
> {
> Setids = documentIndexers.keySet();
> int nrOfSearchers = ids.size();
> int totalItems = 0;
> Pagedocs = new Page(sdc.getPage(), sdc.getPageSize());
> for (Iteratoriterator = ids.iterator();
> iterator.hasNext();)
> {
> Integer id = iterator.next();
> Listindexers = documentIndexers.get(id);
> DocumentIndexer indexer =
> indexers.get(random.nextInt(indexers.size()));
> 

Re: Extending XmlRequestHandler

2008-05-09 Thread Alexander Ramos Jardim
Ok,

Thanks for the advice!

I got the XmlRequestHandler code. I see it uses Stax right at the XML it
gets. There isn't anything to plug in or out to get an easy way to change
the xml format.

So, I am thinking about creating my own RequestHandler as  said already.

Would it be too slow to use a XQuery approach, or would it be ok?

2008/5/8 Tricia Williams <[EMAIL PROTECTED]>:

> I frequently use the Solr API:
> http://lucene.apache.org/solr/api/index.html
>
> Tricia
>
>
> Alexander Ramos Jardim wrote:
>
>> Sorry for the stupid question, but I could not find Solr API code. Could
>> anyone point me where do I find it?
>>
>> 2008/5/8 Alexander Ramos Jardim <[EMAIL PROTECTED]>:
>>
>>
>>
>>> Nice,
>>> Thank you. I will try this out.
>>>
>>> 2008/5/8 Ryan McKinley <[EMAIL PROTECTED]>:
>>>
>>>
>>>
>>>
 The XML format is fixed, and there is not a good way to change it.  If
 you
 can transform your custom docs via XSLT, down the line this may be
 possible
  (it currently is not)

 If you really need to index your custom XML format, write your own
 RequestHandler modeled on XmlRequestHandler, but (most likely) not
 extending
 it.




 On May 8, 2008, at 5:29 PM, Alexander Ramos Jardim wrote:



> Hello,
>
> I want to know how do I set the xml file format that XmlRequestHandler
> understands. Should I extend it, or it can be done via some
> configuration,
> maybe some xml file describing the template it should understand?
>
> I understand the easiest way to do that is getting the original xml
> file
> and
> converting it the expected format via XQuery or XSLT. After that I
> would
> post the file. I could extend XmlRequestHandler, call the apropriate
> method
> and  run a  the correct method from the original XmlRequestHandler
> right?
>
> --
> Alexander Ramos Jardim
>
>
>


>>> --
>>> Alexander Ramos Jardim
>>>
>>>
>>
>>
>>
>>
>>
>>
>
>


-- 
Alexander Ramos Jardim


Re: Weird problems with document size

2008-05-09 Thread Andrew Savory
Hi,

On 09/05/2008, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

>  I don't understand what that lock and unlock is for...
>  Just do this:
>  add
>  add
>  add
>  add
>  ...
>  ...
>  optionally commit or optimize

Yeah, I didn't understand what the lock/unlock was for either - but on
further reviewing the code, we have a wrapper around the solr servlet
which does a crude type of locking to ensure only one index updater
process can run at a time. Not sure it's needed, as I'd guess that
solr would handle things gracefully anyway, but it at least stops
multiple index clients firing up.

Meanwhile it seems that these documents can successfully be added to
solr when it is running in jetty, so I'm now trying to find out what
Tomcat is doing to break things.

Thanks for the reply,


Andrew.


Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread James Brady
So our problem is made easier by having complete index  
partitionability by a user_id field. That means at one end of the  
spectrum, we could have one monolithic index for everyone, while at  
the other end of the spectrum we could individual cores for each  
user_id.


At the moment, we've gone for a halfway house somewhere in the middle:  
I've got several large EC2 instances (currently 3), each running a  
single master/slave pair of Solr servers. The servers have several  
cores (currently 10 - a guesstimated good number). As new users  
register, I automatically distribute them across cores. I would like  
to do something with clustering users based on geo-location so that  
cores will get 'time off' for maintenance and optimization for that  
user cluster's nighttime. I'd also like to move in the 1 core per user  
direction as dynamic core creation becomes available.


It seems a lot of what you're describing is really similar to  
MapReduce, so I think Otis' suggestion to look at Hadoop is a good  
one: it might prevent a lot of headaches and they've already solved a  
lot of the tricky problems. There a number of ridiculously sized  
projects using it to solve their scale problems, not least Yahoo...


James

On 9 May 2008, at 01:17, Marcus Herou wrote:


Cool.

Since you must certainly already have a good partitioning scheme,  
could you

elaborate on high level how you set this up ?

I'm certain that I will shoot myself in the foot both once and twice  
before

getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus



On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED] 
>

wrote:

Hi, we have an index of ~300GB, which is at least approaching the  
ballpark

you're in.

Lucky for us, to coin a phrase we have an 'embarassingly  
partitionable'
index so we can just scale out horizontally across commodity  
hardware with
no problems at all. We're also using the multicore features  
available in
development Solr version to reduce granularity of core size by an  
order of
magnitude: this makes for lots of small commits, rather than few  
long ones.


There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!

James


On 8 May 2008, at 23:37, marcusherou wrote:



Hi.

I will as well head into a path like yours within some months from  
now.
Currently I have an index of ~10M docs and only store id's in the  
index

for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G  
documents.

Each
document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at  
least, one
mount point). Hope this will be the right choice, only future can  
tell.


Since we are developing a search engine I frankly don't think even  
having
100's of SOLR instances serving the index will cut it performance  
wise if

we
have one big index. I totally agree with the others claiming that  
you most
definitely will go OOE or hit some other constraints of SOLR if  
you must
have the whole result in memory sort it and create a xml response.  
I did

hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing  
a TB
index will take a long long time and you really want to have an  
optimized

index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index  
over the
disk(s) and let each SOLR instance only have little piece of the  
total
index. This will require a master database or namenode (or simpler  
just a

properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This  
is to
ensure that whenever you introduce a new SOLR instance with a new  
shard

the
master indexer will know what shard to prioritize. This is  
probably not
enough either since all new docs will go to the new shard until it  
is

filled
(have the same size as the others) only then will all shards  
receive docs

in
a loadbalanced fashion. So whenever you want to add a new indexer  
you
probably need to initiate a "stealing" process where it steals  
docs from

the
others until it reaches some sort of threshold (10 servers = each  
shard

should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I  
think
doing a distributed reindexing will as well be a good thing when  
it comes

to
cutting both indexing and optimizing speed. Probably each indexer  
should
buffer it's shard locally on RAID1 SCSI disks, optimize 

Re: Extending XmlRequestHandler

2008-05-09 Thread Daniel Papasian

Alexander Ramos Jardim wrote:

Ok,

Thanks for the advice!

I got the XmlRequestHandler code. I see it uses Stax right at the XML it
gets. There isn't anything to plug in or out to get an easy way to change
the xml format.


To maybe save you from reinventing the wheel, when I asked a similar 
question a couple weeks back, hossman pointed me towards SOLR-285 and 
SOLR-370.  285 does XSLT, 270 does STX.


Daniel


Re: Weird problems with document size

2008-05-09 Thread Otis Gospodnetic
Right, there is no need for that locking, you can safely have multiple 
indexing/update requests hitting Solr in parallel.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Andrew Savory <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 1:24:06 PM
> Subject: Re: Weird problems with document size
> 
> Hi,
> 
> On 09/05/2008, Otis Gospodnetic wrote:
> 
> >  I don't understand what that lock and unlock is for...
> >  Just do this:
> >  add
> >  add
> >  add
> >  add
> >  ...
> >  ...
> >  optionally commit or optimize
> 
> Yeah, I didn't understand what the lock/unlock was for either - but on
> further reviewing the code, we have a wrapper around the solr servlet
> which does a crude type of locking to ensure only one index updater
> process can run at a time. Not sure it's needed, as I'd guess that
> solr would handle things gracefully anyway, but it at least stops
> multiple index clients firing up.
> 
> Meanwhile it seems that these documents can successfully be added to
> solr when it is running in jetty, so I'm now trying to find out what
> Tomcat is doing to break things.
> 
> Thanks for the reply,
> 
> 
> Andrew.



Re: Loading performance slowdown at ~ 400K documents

2008-05-09 Thread Mike Klaas

Hi Tracy,

What is your Solr/Lucene version?  Is the slowdown sustained or  
temporary (it is not strange to see a slowdown for a few minutes if a  
large segment merge is happening)?


I disagree with Nick's advice of enabling autocommit.

-Mike

On 9-May-08, at 5:02 AM, Tracy Flynn wrote:


Hi,

I'm starting to see significant slowdown in loading performance  
after I have loaded about 400K documents.  I go from a load rate of  
near 40 docs/sec to 20- 25 docs a second.


Am I correct in assuming that, during indexing operations, Lucene/ 
SOLR tries to hold as much of the indexex in memory as possible? If  
so, does the slowdown indicate need to increase JVM heap space?


Any ideas / help would be appreciated

Regards,

Tracy

-

Details

Documents loaded as XML via POST command in batches of 1000, commit  
after each batch


Total current documents ~ 450,000
Avg document size: 4KB
One indexed text field contains 3KB or so. (body field below -  
standard type 'text')


Dual XEON 3 GHZ 4 GB memory

SOLR JVM Startup options

java -Xms256m -Xmx1000m  -jar start.jar


Relevant portion of the schema follows


  stored="true" required="true"/>
  required="false"/>
  required="false"/>
  
  stored="true" required="false" default="0"/>
  stored="true" required="true"/>
  required="false"/>
  required="false" compressed="true"/>
  required="false"/>
  stored="true" required="false" default="0"/>
  required="false"/>
  required="false" default="0"/>
  stored="true" required="false" default="0"/>
  required="false" default="0"/>
  required="false"/>
  required="false"/>
  stored="true" required="false" multiValued="true"/>
  required="false" default="0"/>
  stored="true" required="false" default="0"/>
  stored="true" required="false"/>
  stored="true" required="false" multiValued="true"/>
  stored="true" required="false"/>
  required="false" default="0"/>
  stored="true" required="false"/>
  stored="true" required="false" default="0"/>
  stored="true" required="false"/>
  indexed="true" stored="true" required="false"/>
  indexed="true" stored="true" required="false"/>

   
  stored="true" required="false" />







Re: Extending XmlRequestHandler

2008-05-09 Thread Alexander Ramos Jardim
Thanks,

> To maybe save you from reinventing the wheel, when I asked a similar
> question a couple weeks back, hossman pointed me towards SOLR-285 and
> SOLR-370.  285 does XSLT, 270 does STX.
>
> But sorry, can you point me to the version? I am not acostumed with version
control.


-- 
Alexander Ramos Jardim


Re: Function Query result

2008-05-09 Thread Mike Klaas
No problem.  You can return the favour by clarifying the wiki example,  
since it is publicly editable :).


(It is hard for developers who are very familiar with a system to  
write good documentation for beginners, alas.)


-Mike

On 8-May-08, at 11:44 PM, Umar Shah wrote:


thanks mike,

some how it was not evident from the wiki example, or i was too  
presumptious

;-).

-umar


On Fri, May 9, 2008 at 2:53 AM, Mike Klaas <[EMAIL PROTECTED]>  
wrote:



On 7-May-08, at 11:40 PM, Umar Shah wrote:




That  would be sufficient for my requirements,
I'm using the following query params
q=*:*
&_val_:function(blah, blah)
&fl=*,score

I'm not getting the value, value of score =1,
am I missing something?



Is that really what you are sending to Solr?  The _val_ clause must  
be part

of q, not on its own.

-Mike





Re: How Special Character '&' used in indexing

2008-05-09 Thread Mike Klaas


On 9-May-08, at 6:26 AM, Ricky wrote:


I have tried sending the '&' instead of '&' like the following,
A & K Inc.

But i still get the same error ""entity reference name can not contain
character  ' A  
& ..


Please use a library for doing xml encoding--there is absolutely no  
reason to do this yourself.



Please kindly reply ASAP.


Please also realize that people responding here are donating their  
time and that it is inappropriate to ask for an expedited response.


-Mike



Re: Function Query result

2008-05-09 Thread Umar Shah
Mike,

as asked, I have added an example , hope it will be helpful to future users
.

thanks again.


On Sat, May 10, 2008 at 12:11 AM, Mike Klaas <[EMAIL PROTECTED]> wrote:

> No problem.  You can return the favour by clarifying the wiki example,
> since it is publicly editable :).
>
> (It is hard for developers who are very familiar with a system to write
> good documentation for beginners, alas.)
>
> -Mike
>
>
> On 8-May-08, at 11:44 PM, Umar Shah wrote:
>
>  thanks mike,
>>
>> some how it was not evident from the wiki example, or i was too
>> presumptious
>> ;-).
>>
>> -umar
>>
>>
>> On Fri, May 9, 2008 at 2:53 AM, Mike Klaas <[EMAIL PROTECTED]> wrote:
>>
>>  On 7-May-08, at 11:40 PM, Umar Shah wrote:
>>>
>>>

 That  would be sufficient for my requirements,
 I'm using the following query params
 q=*:*
 &_val_:function(blah, blah)
 &fl=*,score

 I'm not getting the value, value of score =1,
 am I missing something?


>>> Is that really what you are sending to Solr?  The _val_ clause must be
>>> part
>>> of q, not on its own.
>>>
>>> -Mike
>>>
>>>
>


Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Ken Krugler

Hi Marcus,

It seems a lot of what you're describing is really similar to 
MapReduce, so I think Otis' suggestion to look at Hadoop is a good 
one: it might prevent a lot of headaches and they've already solved 
a lot of the tricky problems. There a number of ridiculously sized 
projects using it to solve their scale problems, not least Yahoo...


You should also look at a new project called Katta:

http://katta.wiki.sourceforge.net/

First code check-in should be happening this weekend, so I'd wait 
until Monday to take a look :)


-- Ken


On 9 May 2008, at 01:17, Marcus Herou wrote:


Cool.

Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?

I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus



On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED]>
wrote:


Hi, we have an index of ~300GB, which is at least approaching the ballpark
you're in.

Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
index so we can just scale out horizontally across commodity hardware with
no problems at all. We're also using the multicore features available in
development Solr version to reduce granularity of core size by an order of
magnitude: this makes for lots of small commits, rather than few long ones.

There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!

James


On 8 May 2008, at 23:37, marcusherou wrote:


Hi.

I will as well head into a path like yours within some months from now.
Currently I have an index of ~10M docs and only store id's in the index
for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G documents.
Each
document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at least, one
mount point). Hope this will be the right choice, only future can tell.

Since we are developing a search engine I frankly don't think even having
100's of SOLR instances serving the index will cut it performance wise if
we
have one big index. I totally agree with the others claiming that you most
definitely will go OOE or hit some other constraints of SOLR if you must
have the whole result in memory sort it and create a xml response. I did
hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing a TB
index will take a long long time and you really want to have an optimized
index if you want to reduce search time.

I am thinking of a sharding solution where I fragment the index over the
disk(s) and let each SOLR instance only have little piece of the total
index. This will require a master database or namenode (or simpler just a
properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This is to
ensure that whenever you introduce a new SOLR instance with a new shard
the
master indexer will know what shard to prioritize. This is probably not
enough either since all new docs will go to the new shard until it is
filled
(have the same size as the others) only then will all shards receive docs
in
a loadbalanced fashion. So whenever you want to add a new indexer you
probably need to initiate a "stealing" process where it steals docs from
the
others until it reaches some sort of threshold (10 servers = each shard
should have 1/10 of the docs or such).

I think this will cut it and enabling us to grow with the data. I think
doing a distributed reindexing will as well be a good thing when it comes
to
cutting both indexing and optimizing speed. Probably each indexer should
buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
copy it to the main index to minimize the burden of the shared storage.

Let's say the indexing part will be all fancy and working i TB scale now
we
come to searching. I personally believe after talking to other guys which
have built big search engines that you need to introduce a controller like
searcher on the client side which itself searches in all of the shards and
merges the response. Perhaps Distributed Solr solves this and will love to
test it whenever my new installation of servers and enclosures is
finished.

Currently my idea is something like this.
public Page search(SearchDocumentCommand sdc)
 {
 Set ids = documentIndexers.keySet();
 int nrOfSearchers = ids.size();
 int totalItems = 0;
 Page docs = new Page(sdc.getPage(), 

Re: Solr hardware specs

2008-05-09 Thread Walter Underwood
And use a log of real queries, captured from your website or one
like it. Query statistics are not uniform.

wunder

On 5/9/08 6:20 AM, "Erick Erickson" <[EMAIL PROTECTED]> wrote:

> This still isn't very helpful. How big are the docs? How many fields do you
> expect to index? What is your expected query rate?
> 
> You can get away with an old laptop if your docs are, say, 5K each and you
> only
> expect to query it once a day and have one text field.
> 
> If each doc is 10M, you're indexing 250 fields and your expected
> query rate is 100/sec, you need some serious hardware.
> 
> But 400K docs isn't very big by SOLR standards in terms of the number
> of docs.
> 
> What I'd really recommend is that you just take an existing machine, create
> an
> index on it and measure. Be aware that the first few queries will be much
> slower
> than subsequent queries, so throw out the first few queries from your
> timings.
> 
> Best
> Erick
> 
> On Fri, May 9, 2008 at 8:27 AM, dudes dudes <[EMAIL PROTECTED]> wrote:
> 
>> 
>> HI Nick,
>> 
>> I'm quite new to solr, so excuse my ignorance for any solr related settings
>> :).
>> We think that would have up to 400K docs in a loady environment. We surely
>> don't want to have solr to be publicly
>> accessible ( Just for the internal use). We are not sure if we could have 2
>> network interfaces will help the speed issues that we have
>> due to side location of the company.
>> 
>> thanks,
>> ak
>> 
>>> Date: Sat, 10 May 2008 00:13:49 +1200
>>> From: [EMAIL PROTECTED]
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr hardware specs
>>> 
>>> Hi
>>> It all depends on the load your server is under, how many documents
>>> you have etc. -- I am not sure what you mean by network connectivity
>>> -- solr really should not be run on a publicly accessible IP address.
>>> 
>>> Can you provide some more info on the setup?
>>> -Nick
>>> On 5/10/08, dudes dudes  wrote:
 
  Hello,
 
  Can someone kindly advice me on hardware specs (CPU/HHD/RAM) to install
>> solr on a production server ? We are planning to have it
  on Debian. Also what network connectivities does it require (incoming
>> and outgoing)?
 
  Thanks fr your time.
  ak
  _
  Great deals on almost anything at eBay.co.uk. Search, bid, find and
>> win on eBay today!
  http://clk.atdmt.com/UKM/go/msnnkmgl001004ukm/direct/01/
>> 
>> _
>> Be a Hero and Win with Iron Man
>> http://clk.atdmt.com/UKM/go/msnnkmgl001009ukm/direct/01/



RE: Solr feasibility with terabyte-scale data

2008-05-09 Thread Lance Norskog
A useful schema trick: MD5 or SHA-1 ids. we generate our unique ID with the
MD5 cryptographic checksumming algorithm. This takes X bytes of data and
creates a 128-bit long "random" number, or 128 "random" bits. At this point
there are no reports of two different datasets that give the same checksum.

This gives some handy things: 
a) a fixed-size unique ID field, giving fixed space requirements,
The standard representation of this is 32 hex bytes, i.e.
'deadbeefdeadbeefdeadbeefdeadbeef'. You could make a special 128-bit Lucene
data type for this.

b) the ability to change your mind about the uniqueness formula for your
data,

c) a handy primary key for cross-correlating in other databases,
Think external DBs which supply data for some records. The primary
key is the MD5 signature.

d) the ability to randomly pick subsets of your data.
The record 'id:deadbeefdeadbeefdeadbeefdeadbeef', will match the
wildcard string 'deadbeef*'. And 'd*'.
'd*' selects a perfectly random subset of your data, 1/16 of the
total size. 'd**' gives 1/256 of your data.
This is perfectly random because MD5 gives such a "perfectly" random
hashcode.

This should go on a wiki page 'SchemaDesignTips'.

Cheers,

Lance Norskog





Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
You can't believe how much it pains me to see such nice piece of work live so 
separately.  But I also think I know why it happened :(.  Do you know if Stefan 
& Co. have the intention to bring it under some contrib/ around here?  Would 
that not make sense?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Ken Krugler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 4:26:19 PM
> Subject: Re: Solr feasibility with terabyte-scale data
> 
> Hi Marcus,
> 
> >It seems a lot of what you're describing is really similar to 
> >MapReduce, so I think Otis' suggestion to look at Hadoop is a good 
> >one: it might prevent a lot of headaches and they've already solved 
> >a lot of the tricky problems. There a number of ridiculously sized 
> >projects using it to solve their scale problems, not least Yahoo...
> 
> You should also look at a new project called Katta:
> 
> http://katta.wiki.sourceforge.net/
> 
> First code check-in should be happening this weekend, so I'd wait 
> until Monday to take a look :)
> 
> -- Ken
> 
> >On 9 May 2008, at 01:17, Marcus Herou wrote:
> >
> >>Cool.
> >>
> >>Since you must certainly already have a good partitioning scheme, could you
> >>elaborate on high level how you set this up ?
> >>
> >>I'm certain that I will shoot myself in the foot both once and twice before
> >>getting it right but this is what I'm good at; to never stop trying :)
> >>However it is nice to start playing at least on the right side of the
> >>football field so a little push in the back would be really helpful.
> >>
> >>Kindly
> >>
> >>//Marcus
> >>
> >>
> >>
> >>On Fri, May 9, 2008 at 9:36 AM, James Brady 
> >>wrote:
> >>
> >>>Hi, we have an index of ~300GB, which is at least approaching the ballpark
> >>>you're in.
> >>>
> >>>Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
> >>>index so we can just scale out horizontally across commodity hardware with
> >>>no problems at all. We're also using the multicore features available in
> >>>development Solr version to reduce granularity of core size by an order of
> >>>magnitude: this makes for lots of small commits, rather than few long ones.
> >>>
> >>>There was mention somewhere in the thread of document collections: if
> >>>you're going to be filtering by collection, I'd strongly recommend
> >>>partitioning too. It makes scaling so much less painful!
> >>>
> >>>James
> >>>
> >>>
> >>>On 8 May 2008, at 23:37, marcusherou wrote:
> >>>
> Hi.
> 
> I will as well head into a path like yours within some months from now.
> Currently I have an index of ~10M docs and only store id's in the index
> for
> performance and distribution reasons. When we enter a new market I'm
> assuming we will soon hit 100M and quite soon after that 1G documents.
> Each
> document have in average about 3-5k data.
> 
> We will use a GlusterFS installation with RAID1 (or RAID10) SATA
> enclosures
> as shared storage (think of it as a SAN or shared storage at least, one
> mount point). Hope this will be the right choice, only future can tell.
> 
> Since we are developing a search engine I frankly don't think even having
> 100's of SOLR instances serving the index will cut it performance wise if
> we
> have one big index. I totally agree with the others claiming that you most
> definitely will go OOE or hit some other constraints of SOLR if you must
> have the whole result in memory sort it and create a xml response. I did
> hit
> such constraints when I couldn't afford the instances to have enough
> memory
> and I had only 1M of docs back then. And think of it... Optimizing a TB
> index will take a long long time and you really want to have an optimized
> index if you want to reduce search time.
> 
> I am thinking of a sharding solution where I fragment the index over the
> disk(s) and let each SOLR instance only have little piece of the total
> index. This will require a master database or namenode (or simpler just a
> properties file in each index dir) of some sort to know what docs is
> located
> on which machine or at least how many docs each shard have. This is to
> ensure that whenever you introduce a new SOLR instance with a new shard
> the
> master indexer will know what shard to prioritize. This is probably not
> enough either since all new docs will go to the new shard until it is
> filled
> (have the same size as the others) only then will all shards receive docs
> in
> a loadbalanced fashion. So whenever you want to add a new indexer you
> probably need to initiate a "stealing" process where it steals docs from
> the
> others until it reaches some sort of threshold (10 servers = each shard
> should have 1/10 of the docs or such).
> 
> I think this will cut it and enabling us to grow with

Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Ken Krugler

Hi Otis,

You can't believe how much it pains me to see such nice piece of 
work live so separately.  But I also think I know why it happened 
:(.  Do you know if Stefan & Co. have the intention to bring it 
under some contrib/ around here?  Would that not make sense?


I'm not working on the project, so I can't speak for Stefan & 
friends...but my guess is that it's going to live separately as 
something independent of Solr/Nutch. If you view it as search 
plumbing that's usable in multiple environments, then that makes 
sense. If you view it as replicating core Solr (or Nutch) 
functionality, then it sucks. Not sure what the outcome will be.


-- Ken




- Original Message 

 From: Ken Krugler <[EMAIL PROTECTED]>
 To: solr-user@lucene.apache.org
 Sent: Friday, May 9, 2008 4:26:19 PM
 Subject: Re: Solr feasibility with terabyte-scale data

 Hi Marcus,

 >It seems a lot of what you're describing is really similar to
 >MapReduce, so I think Otis' suggestion to look at Hadoop is a good
 >one: it might prevent a lot of headaches and they've already solved
 >a lot of the tricky problems. There a number of ridiculously sized
 >projects using it to solve their scale problems, not least Yahoo...

 You should also look at a new project called Katta:

 http://katta.wiki.sourceforge.net/

 First code check-in should be happening this weekend, so I'd wait
 until Monday to take a look :)

 -- Ken

 >On 9 May 2008, at 01:17, Marcus Herou wrote:
 >
 >>Cool.
 >>
 >>Since you must certainly already have a good partitioning 
scheme, could you

 >>elaborate on high level how you set this up ?
 >>
 >>I'm certain that I will shoot myself in the foot both once and 
twice before

 >>getting it right but this is what I'm good at; to never stop trying :)
 >>However it is nice to start playing at least on the right side of the
 >>football field so a little push in the back would be really helpful.
 >>
 >>Kindly
 >>
 >>//Marcus
 >>
 >>
 >>
 >>On Fri, May 9, 2008 at 9:36 AM, James Brady
 >>wrote:
 >>
 >>>Hi, we have an index of ~300GB, which is at least approaching 
the ballpark

 >>>you're in.
 >>>
 >>>Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
 >>>index so we can just scale out horizontally across commodity 
hardware with

 >>>no problems at all. We're also using the multicore features available in
 >>>development Solr version to reduce granularity of core size by 
an order of
 >>>magnitude: this makes for lots of small commits, rather than 
few long ones.

 >>>
 >>>There was mention somewhere in the thread of document collections: if
 >>>you're going to be filtering by collection, I'd strongly recommend
 >>>partitioning too. It makes scaling so much less painful!
 >>>
 >>>James
 >>>
 >>>
 >>>On 8 May 2008, at 23:37, marcusherou wrote:
 >>>
 Hi.
 
 I will as well head into a path like yours within some months from now.
 Currently I have an index of ~10M docs and only store id's in the index
 for
 performance and distribution reasons. When we enter a new market I'm
 assuming we will soon hit 100M and quite soon after that 1G documents.
 Each
 document have in average about 3-5k data.
 
 We will use a GlusterFS installation with RAID1 (or RAID10) SATA

 > enclosures

 as shared storage (think of it as a SAN or shared storage at least, one
 mount point). Hope this will be the right choice, only future can tell.
 
 Since we are developing a search engine I frankly don't think 
even having
 100's of SOLR instances serving the index will cut it 
performance wise if

 we
 have one big index. I totally agree with the others claiming 
that you most

 definitely will go OOE or hit some other constraints of SOLR if you must
 have the whole result in memory sort it and create a xml response. I did
 hit
 such constraints when I couldn't afford the instances to have enough

 > memory

 and I had only 1M of docs back then. And think of it... Optimizing a TB
 index will take a long long time and you really want to have 
an optimized

 index if you want to reduce search time.
 
 I am thinking of a sharding solution where I fragment the index over the
 disk(s) and let each SOLR instance only have little piece of the total
 index. This will require a master database or namenode (or 
simpler just a

 properties file in each index dir) of some sort to know what docs is
 located
 on which machine or at least how many docs each shard have. This is to
 ensure that whenever you introduce a new SOLR instance with a new shard
 the
 master indexer will know what shard to prioritize. This is probably not
 enough either since all new docs will go to the new shard until it is
 filled
 (have the same size as the others) only then will all shards 
receive docs

 in
 a loadbalanced fashion. So whenever you want to add a new indexer you
 probably need to init

Re: Function Query result

2008-05-09 Thread Mike Klaas

Thanks so much Umar!

-Mike

On 9-May-08, at 1:22 PM, Umar Shah wrote:


Mike,

as asked, I have added an example , hope it will be helpful to  
future users

.

thanks again.


On Sat, May 10, 2008 at 12:11 AM, Mike Klaas <[EMAIL PROTECTED]>  
wrote:


No problem.  You can return the favour by clarifying the wiki  
example,

since it is publicly editable :).

(It is hard for developers who are very familiar with a system to  
write

good documentation for beginners, alas.)

-Mike


On 8-May-08, at 11:44 PM, Umar Shah wrote:

thanks mike,


some how it was not evident from the wiki example, or i was too
presumptious
;-).

-umar


On Fri, May 9, 2008 at 2:53 AM, Mike Klaas <[EMAIL PROTECTED]>  
wrote:


On 7-May-08, at 11:40 PM, Umar Shah wrote:





That  would be sufficient for my requirements,
I'm using the following query params
q=*:*
&_val_:function(blah, blah)
&fl=*,score

I'm not getting the value, value of score =1,
am I missing something?


Is that really what you are sending to Solr?  The _val_ clause  
must be

part
of q, not on its own.

-Mike








exceeded limit of maxWarmingSearchers

2008-05-09 Thread Sasha Voynow
Hi:

I'm getting flurries of these error messages:



WARNING: Error opening new searcher. exceeded limit of
maxWarmingSearchers=4, try again later.

SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
exceeded limit of maxWarmingSearchers=4, try again later.


On a solr instance where I am in the process of indexing moderately large
number of documents (300K+). There is no querying of the index taking place
at all.
I don't understand what operations are causing new searchers to warm, or how
to stop them from doing so.  I'd be happy to provide more details of my
configuration if necessary, I've made very few changes to the solrconfig.xml
that comes with the sample application.


Thanks.


SV


Re: exceeded limit of maxWarmingSearchers

2008-05-09 Thread Otis Gospodnetic
Sasha,

Do you have postCommit or postOptimize hooks enabled?  Are you sending commits 
or have autoCommit on?

My suggestions:
- comment out post* hooks
- do not send a commit until you are done (or you can just optimize at the end)
- disable autoCommit


If there is anything else that could trigger searcher warming, I can't think of 
it at the moment.  Let us know if the above eliminates the problem.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Sasha Voynow <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 6:59:00 PM
> Subject: exceeded limit of maxWarmingSearchers
> 
> Hi:
> 
> I'm getting flurries of these error messages:
> 
> 
> 
> WARNING: Error opening new searcher. exceeded limit of
> maxWarmingSearchers=4, try again later.
> 
> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> exceeded limit of maxWarmingSearchers=4, try again later.
> 
> 
> On a solr instance where I am in the process of indexing moderately large
> number of documents (300K+). There is no querying of the index taking place
> at all.
> I don't understand what operations are causing new searchers to warm, or how
> to stop them from doing so.  I'd be happy to provide more details of my
> configuration if necessary, I've made very few changes to the solrconfig.xml
> that comes with the sample application.
> 
> 
> Thanks.
> 
> 
> SV



Re: Solr feasibility with terabyte-scale data

2008-05-09 Thread Otis Gospodnetic
>From what I can tell from the overview on http://katta.wiki.sourceforge.net/, 
>it's a partial replication of Solr/Nutch functionality, plus some goodies.  It 
>might have been better to work those goodies into some friendly contrib/ be it 
>Solr, Nutch, Hadoop, or Lucene.  Anyhow, let's see what happens there! :)


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Ken Krugler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 5:37:19 PM
> Subject: Re: Solr feasibility with terabyte-scale data
> 
> Hi Otis,
> 
> >You can't believe how much it pains me to see such nice piece of 
> >work live so separately.  But I also think I know why it happened 
> >:(.  Do you know if Stefan & Co. have the intention to bring it 
> >under some contrib/ around here?  Would that not make sense?
> 
> I'm not working on the project, so I can't speak for Stefan & 
> friends...but my guess is that it's going to live separately as 
> something independent of Solr/Nutch. If you view it as search 
> plumbing that's usable in multiple environments, then that makes 
> sense. If you view it as replicating core Solr (or Nutch) 
> functionality, then it sucks. Not sure what the outcome will be.
> 
> -- Ken
> 
> 
> 
> >- Original Message 
> >>  From: Ken Krugler 
> >>  To: solr-user@lucene.apache.org
> >>  Sent: Friday, May 9, 2008 4:26:19 PM
> >>  Subject: Re: Solr feasibility with terabyte-scale data
> >>
> >>  Hi Marcus,
> >>
> >>  >It seems a lot of what you're describing is really similar to
> >>  >MapReduce, so I think Otis' suggestion to look at Hadoop is a good
> >>  >one: it might prevent a lot of headaches and they've already solved
> >>  >a lot of the tricky problems. There a number of ridiculously sized
> >>  >projects using it to solve their scale problems, not least Yahoo...
> >>
> >>  You should also look at a new project called Katta:
> >>
> >>  http://katta.wiki.sourceforge.net/
> >>
> >>  First code check-in should be happening this weekend, so I'd wait
> >>  until Monday to take a look :)
> >>
> >>  -- Ken
> >>
> >>  >On 9 May 2008, at 01:17, Marcus Herou wrote:
> >>  >
> >>  >>Cool.
> >>  >>
> >>  >>Since you must certainly already have a good partitioning 
> >>scheme, could you
> >>  >>elaborate on high level how you set this up ?
> >>  >>
> >>  >>I'm certain that I will shoot myself in the foot both once and 
> >>twice before
> >>  >>getting it right but this is what I'm good at; to never stop trying :)
> >>  >>However it is nice to start playing at least on the right side of the
> >>  >>football field so a little push in the back would be really helpful.
> >>  >>
> >>  >>Kindly
> >>  >>
> >>  >>//Marcus
> >>  >>
> >>  >>
> >>  >>
> >>  >>On Fri, May 9, 2008 at 9:36 AM, James Brady
> >>  >>wrote:
> >>  >>
> >>  >>>Hi, we have an index of ~300GB, which is at least approaching 
> >>the ballpark
> >>  >>>you're in.
> >>  >>>
> >>  >>>Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
> >>  >>>index so we can just scale out horizontally across commodity 
> >>hardware with
> >>  >>>no problems at all. We're also using the multicore features available 
> >> in
> >>  >>>development Solr version to reduce granularity of core size by 
> >>an order of
> >>  >>>magnitude: this makes for lots of small commits, rather than 
> >>few long ones.
> >>  >>>
> >>  >>>There was mention somewhere in the thread of document collections: if
> >>  >>>you're going to be filtering by collection, I'd strongly recommend
> >>  >>>partitioning too. It makes scaling so much less painful!
> >>  >>>
> >>  >>>James
> >>  >>>
> >>  >>>
> >>  >>>On 8 May 2008, at 23:37, marcusherou wrote:
> >>  >>>
> >>  Hi.
> >>  
> >>  I will as well head into a path like yours within some months from 
> >> now.
> >>  Currently I have an index of ~10M docs and only store id's in the 
> >> index
> >>  for
> >>  performance and distribution reasons. When we enter a new market I'm
> >>  assuming we will soon hit 100M and quite soon after that 1G documents.
> >>  Each
> >>  document have in average about 3-5k data.
> >>  
> >>  We will use a GlusterFS installation with RAID1 (or RAID10) SATA
> >  > enclosures
> >>  as shared storage (think of it as a SAN or shared storage at least, 
> >> one
> >>  mount point). Hope this will be the right choice, only future can 
> >> tell.
> >>  
> >>  Since we are developing a search engine I frankly don't think 
> >>even having
> >>  100's of SOLR instances serving the index will cut it 
> >>performance wise if
> >>  we
> >>  have one big index. I totally agree with the others claiming 
> >>that you most
> >>  definitely will go OOE or hit some other constraints of SOLR if you 
> >> must
> >>  have the whole result in memory sort it and create a xml response. I 
> >> did
> >>  hit
> >>  such constraints when I couldn't afford the i

Re: exceeded limit of maxWarmingSearchers

2008-05-09 Thread Otis Gospodnetic
Bah, ignore 30% of what I said below - 30% of my mind was following Sesame 
Street, another 30% was looking at some Hadoop jobs, and the last 30% was 
writing the response.  The missing 10% is missing.

Leave the post* hook(s) in, they are fine -- you have to trigger the 
snapshooter somehow, I presume.  My guess is you or autoCommit is commiting too 
frequently.  Just don't commit until the end, or at least that's what I try to 
do.


Back to washing dishes for a change.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Friday, May 9, 2008 7:18:03 PM
> Subject: Re: exceeded limit of maxWarmingSearchers
> 
> Sasha,
> 
> Do you have postCommit or postOptimize hooks enabled?  Are you sending 
> commits 
> or have autoCommit on?
> 
> My suggestions:
> - comment out post* hooks
> - do not send a commit until you are done (or you can just optimize at the 
> end)
> - disable autoCommit
> 
> 
> If there is anything else that could trigger searcher warming, I can't think 
> of 
> it at the moment.  Let us know if the above eliminates the problem.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> - Original Message 
> > From: Sasha Voynow 
> > To: solr-user@lucene.apache.org
> > Sent: Friday, May 9, 2008 6:59:00 PM
> > Subject: exceeded limit of maxWarmingSearchers
> > 
> > Hi:
> > 
> > I'm getting flurries of these error messages:
> > 
> > 
> > 
> > WARNING: Error opening new searcher. exceeded limit of
> > maxWarmingSearchers=4, try again later.
> > 
> > SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> > exceeded limit of maxWarmingSearchers=4, try again later.
> > 
> > 
> > On a solr instance where I am in the process of indexing moderately large
> > number of documents (300K+). There is no querying of the index taking place
> > at all.
> > I don't understand what operations are causing new searchers to warm, or how
> > to stop them from doing so.  I'd be happy to provide more details of my
> > configuration if necessary, I've made very few changes to the solrconfig.xml
> > that comes with the sample application.
> > 
> > 
> > Thanks.
> > 
> > 
> > SV



Re: exceeded limit of maxWarmingSearchers

2008-05-09 Thread Sasha Voynow
It happened without auto-commit. Although I would like to be able to use a
reasonably infrequent autocommit setting. Is it generally better to handle
batching your commits programmatically on the "client" side rather than
relying on auto-commit?As far as post* hooks. I will comment out a post
optimize hook that I don't need. But I can't imagine that's causing a
problem, if only because I hadn't optimized (or is optimization run for you
automatically in some cases?)


I'll let you know the results.


Thanks for the prompt reply.

On 5/9/08, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>
> Sasha,
>
> Do you have postCommit or postOptimize hooks enabled?  Are you sending
> commits or have autoCommit on?
>
> My suggestions:
> - comment out post* hooks
> - do not send a commit until you are done (or you can just optimize at the
> end)
> - disable autoCommit
>
>
> If there is anything else that could trigger searcher warming, I can't
> think of it at the moment.  Let us know if the above eliminates the problem.
>
> Otis
>
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Sasha Voynow <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, May 9, 2008 6:59:00 PM
> > Subject: exceeded limit of maxWarmingSearchers
> >
> > Hi:
> >
> > I'm getting flurries of these error messages:
> >
> >
> >
> > WARNING: Error opening new searcher. exceeded limit of
> > maxWarmingSearchers=4, try again later.
> >
> > SEVERE: org.apache.solr.common.SolrException: Error opening new searcher.
> > exceeded limit of maxWarmingSearchers=4, try again later.
> >
> >
> > On a solr instance where I am in the process of indexing moderately large
> > number of documents (300K+). There is no querying of the index taking
> place
> > at all.
> > I don't understand what operations are causing new searchers to warm, or
> how
> > to stop them from doing so.  I'd be happy to provide more details of my
> > configuration if necessary, I've made very few changes to the
> solrconfig.xml
> > that comes with the sample application.
> >
> >
> > Thanks.
> >
> >
> > SV
>
>


Re: exceeded limit of maxWarmingSearchers

2008-05-09 Thread Ryan McKinley


On May 9, 2008, at 7:33 PM, Sasha Voynow wrote:

Is it generally better to handle
batching your commits programmatically on the "client" side rather  
than

relying on auto-commit?


the time based auto-commit is useful if you are indexing from multiple  
clients to a single server.  Rather then risk having clients send the  
 command too often, you can just handle it on the server.


If you are doing all your indexing from a single client, you can wait  
till it is all done, then send a single 


ryan


Simple Solr POST using java

2008-05-09 Thread Marshall Gunter
Can someone please tell me why this code snippet would not add a 
document to the Solr index after a "" was issued or
please post a snippet of Java code to add a document to the Solr index 
that includes the URL reference as a String?


Code example:

String strToAdd =
   "
   
foo
bar
   
   ";

URL url = new URL("http://localhost:8080//update/");

HttpURLConnection urlc = (HttpURLConnection)url.openConnection();
urlc.setRequestMethod("POST");
urlc.setDoOutput(true);

urlc.setRequestProperty("Content-type", "text/xml; charset=UTF-8");

OutputStreamWriter out = new OutputStreamWriter(urlc.getOutputStream());


out.write(strToAdd);
out.flush();

--
Marshall Gunter