Re: How can i omit the illegal characters,when indexing the docs?
For documents we are indexing via the PHP client, we are currently using the following regex to strip control characters from each field that might contain them: function apachesolr_strip_ctl_chars($text) { // See: http://w3.org/International/questions/qa-forms-utf-8.html // Printable utf-8 does not include any of these chars below x7F return preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', ' ', $text); } -Peter On Fri, Jan 2, 2009 at 3:41 AM, RaghavPrabhu wrote: > > Hi all, > > I am extracting the word document using Apache POI,then generate the xml > doc,which is the document that i want to indexing in the solr. The problem > which i faced was,it thrown the error in the browser is shown below. > > HTTP Status 500 - Illegal character ((CTRL-CHAR, code 8)) at [row,col > {unknown-source}]: [1,1592] > com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, > code 8)) at [row,col {unknown-source}]: [1,1592] at > com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) at > com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:660) at > com.ctc.wstx.sr.BasicStreamReader.readCDataPrimary(BasicStreamReader.java:4240) > at > com.ctc.wstx.sr.BasicStreamReader.nextFromTreeCommentOrCData(BasicStreamReader.java:3280) > at > com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2824) > at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) at > org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321) > at > org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195) > at > org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) > at > org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:179) > at > org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) > at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446) > at java.lang.Thread.run(Thread.java:619) > > The extracted word document contains the special character ( its like a > square box).How can i omit those characters,when i submit the document to > the solr. > > > Thanks in advance, > Regards > Prabhu.K > > > -- > View this message in context: > http://www.nabble.com/How-can-i-omit-the-illegal-characters%2Cwhen-indexing-the-docs--tp21249084p21249084.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Pgination in Solr
: solr supports params start and rows : append &start=X&rows=Y to the url (assuming you are using standard : request handler) : : where X = page number : and Y = results per page. not quite ... "start" is the result number you wnat to start at (zero indexed), so if you want 10 results per page your "rows" value would always be "10" and your start values would be 0, 10, 20, 30, 40, etc... -Hoss
Re: cannot allocate memory for snapshooter
Your right, its nasty, but its how Fork works. I would say its something that should be fixed, its so nasty, but with the new all Java replication, its probably a moot point. Forking for a small script on something that can have such a large memory footprint is just a huge waste of resources. Ideally you might have a tiny program running, listening on a socket or something, and it can be alerted and do the actual fork (being small itself). Or some other such workaround, other than copying a few gig into RAM or swap :) The new all Java replication looks a little nicer in the face of this (someone was asking about the differences earlier). - Mark Brian Whitman wrote: Thanks for the pointer. (It seems really weird to alloc 5GB of swap just because the JVM needs to run a shell script.. but I get hoss's explanation in the following post) On Fri, Jan 2, 2009 at 2:37 PM, Bill Au wrote: add more swap space: http://www.nabble.com/Not-enough-space-to11423199.html#a11424938 Bill On Fri, Jan 2, 2009 at 10:52 AM, Brian Whitman wrote: I have an indexing machine on a test server (a mid-level EC2 instance, 8GB of RAM) and I run jetty like: java -server -Xms5g -Xmx5g -XX:MaxPermSize=128m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap -Dsolr.solr.home=/vol/solr -Djava.awt.headless=true -jar start.jar The indexing master is set to snapshoot on commit. Sometimes (not always) the snapshot fails with SEVERE: java.io.IOException: Cannot run program "/vol/solr/bin/snapshooter": java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(Unknown Source) Why would snapshooter need more than 2GB ram? /proc/meminfo says (with solr running & nothing else) MemTotal: 7872040 kB MemFree: 2018404 kB Buffers: 67704 kB Cached:2161880 kB SwapCached: 0 kB Active:3446348 kB Inactive: 2186964 kB SwapTotal: 0 kB SwapFree:0 kB Dirty: 8 kB Writeback: 0 kB AnonPages: 3403728 kB Mapped: 12016 kB Slab:37804 kB SReclaimable:20048 kB SUnreclaim: 17756 kB PageTables: 7476 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 3936020 kB Committed_AS: 5383624 kB VmallocTotal: 34359738367 kB VmallocUsed: 340 kB VmallocChunk: 34359738027 kB
Re: cannot allocate memory for snapshooter
On Sun, Jan 4, 2009 at 8:07 PM, Mark Miller wrote: > Forking for a small script on something that can have such a large memory > footprint is just a huge waste of resources. Ideally you might have a tiny > program running, listening on a socket or something, and it can be alerted > and do the actual fork (being small itself). Or some other such workaround, > other than copying a few gig into RAM or swap :) Well, fork doesn't actually copy anymore (for a long time now) - it's really only the page tables that get copied and set to copy-on-write so the fork is actually pretty lightweight. The issue is that the OS is being conservative and checking that there would be enough RAM+SWAP available if all of the process address space did have to be copied/allocated (older versions of linux didn't do this check and allowed memory overcommit). The OS doesn't know that the fork will be followed by an exec. So the workaround of creating more swap is just so that this OS memory overcommit check passes. The swap won't actually be used by the fork + exec. The real fix would be for the JVM to use something like vfork when available. -Yonik
Re: cannot allocate memory for snapshooter
Yonik Seeley wrote: On Sun, Jan 4, 2009 at 8:07 PM, Mark Miller wrote: Forking for a small script on something that can have such a large memory footprint is just a huge waste of resources. Ideally you might have a tiny program running, listening on a socket or something, and it can be alerted and do the actual fork (being small itself). Or some other such workaround, other than copying a few gig into RAM or swap :) Well, fork doesn't actually copy anymore (for a long time now) - it's really only the page tables that get copied and set to copy-on-write so the fork is actually pretty lightweight. Right, copying was the wrong word. It depends. Depending on your Unix variant, it will actually use vfork, or sometimes..., or sometimes sometimes you have no option to share. (Because you can screw with the parent, I've seen warnings in the doc for vfork that its not recommended even for use - but this could be old now, and was for a particular version of UNIX that I don't remember) The issue is that the OS is being conservative and checking that there would be enough RAM+SWAP available if all of the process address space did have to be copied/allocated (older versions of linux didn't do this check and allowed memory overcommit). The OS doesn't know that the fork will be followed by an exec. I don't think you can just count on that in a unix environment. Maybe Linux took care of it, but is that common on all versions of Unix? And if you have an older version of Linux? So the workaround of creating more swap is just so that this OS memory overcommit check passes. The swap won't actually be used by the fork + exec. Again, only if your lucky. It depends on the many implementations of fork. A lot of times fork is actually vfork or something, but solr can't count on it for everybody I wouldn't think. The real fix would be for the JVM to use something like vfork when available. Which kind of happens under the scenes if your lucky already. Some unix guys don't like it, and I assume thats why its not the standard (overly concerned with the child process mucking up the parent process). I shouldn't have said copy - the issue is that we are looking for way to much RAM. A JVM using 5 gig will look for another 5 - thats terrible. I don't link we can solve it in a universal way for Unix by relying on forking the JVM though. Its hit or miss. The real fix can't depend on your OS varient and its version I wouldn't think.
Re: cannot allocate memory for snapshooter
Hey Brian, I didn't catch what OS you are using on EC2 by the way. I thought most UNIX OS's were using memory overcommit - A quick search brings up Linux, AIX, and HP-UX, and maybe even OSX? What are you running over there? EC2, so Linux I assume? Yonik: I take it, now that Linux uses copy on write, they stopped with the memory overcommit? Or perhaps Brian is not on Linux... I love Unix stuff :) So much Java and Windows in my past, I still find this info really cool. Windows command line was never so interesting. - Mark
Re: collectionDistribution vs SolrReplication
* SolrReplication does not create snapshots . So you have less cleanup to do. The script based replication results is more disk space consumption (especially if you do frequent commits) * Performance is roughly same unless you are replicating across different LAN where SolrReplication can zip and transfer On Sun, Jan 4, 2009 at 11:57 AM, Shalin Shekhar Mangar wrote: > I think the main reason is ease of use. Warming is done the same way by > adding a newSearcher listener in solrconfig.xml > > On Sun, Jan 4, 2009 at 2:10 AM, Marc Sturlese wrote: > >> >> Hey there, >> >> I would like to know the advantages of moving from: >> a master-slave system using CollectionDistribution with all their .sh >> scripts >> http://wiki.apache.org/solr/CollectionDistribution >> >> to: >> use SolrReplication and his solrconfig.xml configuration. >> http://wiki.apache.org/solr/SolrReplication >> >> >> Its tecnically much better or mainly for more easy use? >> Does SolrReplication do warming aswell? >> >> Checking performance numbers is solrReplication wiki page things seem to be >> similar except for the RAM, are the advantages about that? >> >> Thanks in advance!! >> -- >> View this message in context: >> http://www.nabble.com/collectionDistribution-vs-SolrReplication-tp21269112p21269112.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul