Re: How can i omit the illegal characters,when indexing the docs?

2009-01-04 Thread Peter Wolanin
For documents we are indexing via the PHP client, we are currently
using the following regex to strip control characters from each field
that might contain them:

function apachesolr_strip_ctl_chars($text) {
  // See:  http://w3.org/International/questions/qa-forms-utf-8.html
  // Printable utf-8 does not include any of these chars below x7F
  return preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', ' ', $text);
}

-Peter

On Fri, Jan 2, 2009 at 3:41 AM, RaghavPrabhu  wrote:
>
> Hi all,
>
>  I am extracting the word document using Apache POI,then generate the xml
> doc,which is the document that i want to indexing in the solr. The problem
> which i faced was,it thrown the error in the browser is shown below.
>
> HTTP Status 500 - Illegal character ((CTRL-CHAR, code 8)) at [row,col
> {unknown-source}]: [1,1592]
> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR,
> code 8)) at [row,col {unknown-source}]: [1,1592] at
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) at
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:660) at
> com.ctc.wstx.sr.BasicStreamReader.readCDataPrimary(BasicStreamReader.java:4240)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTreeCommentOrCData(BasicStreamReader.java:3280)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2824)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) at
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> at
> org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:179)
> at
> org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
> at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446)
> at java.lang.Thread.run(Thread.java:619)
>
> The extracted word document contains the special character ( its like a
> square box).How can i omit those characters,when i submit the document to
> the solr.
>
>
> Thanks in advance,
> Regards
> Prabhu.K
>
>
> --
> View this message in context: 
> http://www.nabble.com/How-can-i-omit-the-illegal-characters%2Cwhen-indexing-the-docs--tp21249084p21249084.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Re: Pgination in Solr

2009-01-04 Thread Chris Hostetter
: solr supports params start and rows
: append &start=X&rows=Y to the url (assuming you are using standard
: request handler)
: 
: where X = page number
: and Y = results per page.

not quite ... "start" is the result number you wnat to start at (zero 
indexed), so if you want 10 results per page your "rows" value would 
always be "10" and your start values would be 0, 
10, 20, 30, 40, etc...



-Hoss



Re: cannot allocate memory for snapshooter

2009-01-04 Thread Mark Miller
Your right, its nasty, but its how Fork works. I would say its something 
that should be fixed, its so nasty, but with the new all Java 
replication, its probably a moot point.


Forking for a small script on something that can have such a large 
memory footprint is just a huge waste of resources. Ideally you might 
have a tiny program running, listening on a socket or something, and it 
can be alerted and do the actual fork (being small itself). Or some 
other such workaround, other than copying a few gig into RAM or swap :)


The new all Java replication looks a little nicer in the face of this 
(someone was asking about the differences earlier).


- Mark

Brian Whitman wrote:

Thanks for the pointer. (It seems really weird to alloc 5GB of swap just
because the JVM needs to run a shell script.. but I get hoss's explanation
in the following post)

On Fri, Jan 2, 2009 at 2:37 PM, Bill Au  wrote:

  

add more swap space:
http://www.nabble.com/Not-enough-space-to11423199.html#a11424938

Bill

On Fri, Jan 2, 2009 at 10:52 AM, Brian Whitman  wrote:



I have an indexing machine on a test server (a mid-level EC2 instance,
  

8GB


of RAM) and I run jetty like:

java -server -Xms5g -Xmx5g -XX:MaxPermSize=128m
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap
-Dsolr.solr.home=/vol/solr -Djava.awt.headless=true -jar start.jar

The indexing master is set to snapshoot on commit. Sometimes (not always)
the snapshot fails with

SEVERE: java.io.IOException: Cannot run program
"/vol/solr/bin/snapshooter":
java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(Unknown Source)

Why would snapshooter need more than 2GB ram?  /proc/meminfo says (with
solr
running & nothing else)

MemTotal:  7872040 kB
MemFree:   2018404 kB
Buffers: 67704 kB
Cached:2161880 kB
SwapCached:  0 kB
Active:3446348 kB
Inactive:  2186964 kB
SwapTotal:   0 kB
SwapFree:0 kB
Dirty:   8 kB
Writeback:   0 kB
AnonPages: 3403728 kB
Mapped:  12016 kB
Slab:37804 kB
SReclaimable:20048 kB
SUnreclaim:  17756 kB
PageTables:   7476 kB
NFS_Unstable:0 kB
Bounce:  0 kB
CommitLimit:   3936020 kB
Committed_AS:  5383624 kB
VmallocTotal: 34359738367 kB
VmallocUsed:   340 kB
VmallocChunk: 34359738027 kB

  


  




Re: cannot allocate memory for snapshooter

2009-01-04 Thread Yonik Seeley
On Sun, Jan 4, 2009 at 8:07 PM, Mark Miller  wrote:
> Forking for a small script on something that can have such a large memory
> footprint is just a huge waste of resources. Ideally you might have a tiny
> program running, listening on a socket or something, and it can be alerted
> and do the actual fork (being small itself). Or some other such workaround,
> other than copying a few gig into RAM or swap :)

Well, fork doesn't actually copy anymore (for a long time now) - it's
really only the page tables that get copied and set to copy-on-write
so the fork is actually pretty lightweight.
The issue is that the OS is being conservative and checking that there
would be enough RAM+SWAP available if all of the process address space
did have to be copied/allocated (older versions of linux didn't do
this check and allowed memory overcommit).  The OS doesn't know that
the fork will be followed by an exec.

So the workaround of creating more swap is just so that this OS memory
overcommit check passes.  The swap won't actually be used by the fork
+ exec.

The real fix would be for the JVM to use something like vfork when available.

-Yonik


Re: cannot allocate memory for snapshooter

2009-01-04 Thread Mark Miller

Yonik Seeley wrote:

On Sun, Jan 4, 2009 at 8:07 PM, Mark Miller  wrote:
  

Forking for a small script on something that can have such a large memory
footprint is just a huge waste of resources. Ideally you might have a tiny
program running, listening on a socket or something, and it can be alerted
and do the actual fork (being small itself). Or some other such workaround,
other than copying a few gig into RAM or swap :)



Well, fork doesn't actually copy anymore (for a long time now) - it's
really only the page tables that get copied and set to copy-on-write
so the fork is actually pretty lightweight.
  
Right, copying was the wrong word. It depends. Depending on your Unix 
variant, it will actually use vfork, or sometimes..., or sometimes
sometimes you have no option to share. (Because you can screw with the 
parent, I've seen warnings in the doc for vfork that its not recommended 
even for use - but this could be old now, and was for a particular 
version of UNIX that I don't remember)

The issue is that the OS is being conservative and checking that there
would be enough RAM+SWAP available if all of the process address space
did have to be copied/allocated (older versions of linux didn't do
this check and allowed memory overcommit).  The OS doesn't know that
the fork will be followed by an exec.
  
I don't think you can just count on that in a unix environment. Maybe 
Linux took care of it, but is that common on all versions of Unix? And 
if you have an older version of Linux?

So the workaround of creating more swap is just so that this OS memory
overcommit check passes.  The swap won't actually be used by the fork
+ exec.
  
Again, only if your lucky. It depends on the many implementations of 
fork. A lot of times fork is actually vfork or something, but solr can't 
count on it for everybody I wouldn't think.

The real fix would be for the JVM to use something like vfork when available.
  
Which kind of happens under the scenes if your lucky already. Some unix 
guys don't like it, and I assume thats why its not the standard (overly 
concerned with the child process mucking up the parent process).


I shouldn't have said copy - the issue is that we are looking for way to 
much RAM. A JVM using 5 gig will look for another 5 - thats terrible. I 
don't link we can solve it in a universal way for Unix by relying on 
forking the JVM though. Its hit or miss. The real fix can't depend on 
your OS varient and its version I wouldn't think.


Re: cannot allocate memory for snapshooter

2009-01-04 Thread Mark Miller
Hey Brian, I didn't catch what OS you are using on EC2 by the way. I 
thought most UNIX OS's were using memory overcommit - A quick search 
brings up Linux, AIX, and HP-UX, and maybe even OSX?


What are you running over there? EC2, so Linux I assume?

Yonik: I take it, now that Linux uses copy on write, they stopped with 
the memory overcommit? Or perhaps Brian is not on Linux...


I love Unix stuff :) So much Java and Windows in my past, I still find 
this info really cool. Windows command line was never so interesting.



- Mark


Re: collectionDistribution vs SolrReplication

2009-01-04 Thread Noble Paul നോബിള്‍ नोब्ळ्
* SolrReplication does not create snapshots . So you have less cleanup
to do. The script based replication results is more disk space
consumption (especially if you do frequent commits)
* Performance is roughly same unless you are replicating across
different LAN where SolrReplication can zip and transfer




On Sun, Jan 4, 2009 at 11:57 AM, Shalin Shekhar Mangar
 wrote:
> I think the main reason is ease of use. Warming is done the same way by
> adding a newSearcher listener in solrconfig.xml
>
> On Sun, Jan 4, 2009 at 2:10 AM, Marc Sturlese wrote:
>
>>
>> Hey there,
>>
>> I would like to know the advantages of moving from:
>> a master-slave system using CollectionDistribution with all their .sh
>> scripts
>> http://wiki.apache.org/solr/CollectionDistribution
>>
>> to:
>> use SolrReplication and his solrconfig.xml configuration.
>> http://wiki.apache.org/solr/SolrReplication
>>
>>
>> Its tecnically much better or mainly for more easy use?
>> Does SolrReplication do warming aswell?
>>
>> Checking performance numbers is solrReplication wiki page things seem to be
>> similar except for the RAM, are the advantages about that?
>>
>> Thanks in advance!!
>> --
>> View this message in context:
>> http://www.nabble.com/collectionDistribution-vs-SolrReplication-tp21269112p21269112.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul