solr / lucene engineering positions in Boston, MA USA @ the Echo Nest
Hi all, brief message to let you know that we're in heavy hire mode at the Echo Nest. As many of you know we are very heavy solr/lucene users (~1bn documents across many many servers) and a lot of our staff have been working with and contributing to the projects over the years. We are a "music intelligence" company -- we crawl the web and do a lot of fancy math on music audio and text to then provide things like recommendation, feeds, remix capabilities, playlisting etc to a lot of music labels, social networks and small developers via a very popular API. We are especially interested in people with Lucene & Solr experience who aren't afraid to get into the guts and push it to its limits. If any of these positions fit you please let me know. We are hiring full time in the Boston area (Davis Square, Somerville) for senior and junior engineers as well as data architects. http://the.echonest.com/company/jobs/ http://developer.echonest.com/docs/v4/ http://the.echonest.com/
autocommit commented out -- what is the default?
Hi, if you comment out the block in solrconfig.xml Does this mean that (a) commits never happen automatically or (b) some default autocommit is applied?
"document commit" possible?
Could the commit operation be adapted to just have the searchers aware of new stored content in a particular document? e.g. With the understanding that queries for newly indexed fields in this document will not return this newly added document, but a query for the document by its id will return any new stored fields. When the "real" commit (read: the commit that takes 10 minutes to complete) returns the newly indexed fields will be query-able.
Re: diversity in results
On Aug 4, 2008, at 12:50 PM, Jason Rennie wrote: Is there any option in solr to encourage diversity in the results? Our solr index has millions of products, many of which are quite similar to each other. Even something simple like max 50% text overlap in successive results would be valuable. Does something like this exist in solr or are there any plans to add it? not out of the box, but I would use the mlt handler on the first result and remove all the ones that appear in both the MLT and query response. B
partialResults, distributed search & SOLR-502
I was going to file a ticket like this: "A SOLR-303 query with &shards=host1,host2,host3 when host3 is down returns an error. One of the advantages of a shard implementation is that data can be stored redundantly across different shards, either as direct copies (e.g. when host1 and host3 are snapshooter'd copies of each other) or where there is some "data RAID" that stripes indexes for redundancy." But then I saw SOLR-502, which appears to be committed. If I have the above scenario (host1,host2,host3 where host3 is not up) and set a timeAllowed, will I still get a 400 or will it come back with "partial" results? If not, can we think of a way to get this to work? It's my understanding already that duplicate docIDs are merged in the SOLR-303 response, so other than building in some "this host isn't working, just move on and report it" and of course the work to index redundantly, we wouldn't need anything to achieve a good redundant shard implementation. B
Re: partialResults, distributed search & SOLR-502
On Aug 18, 2008, at 11:51 AM, Ian Connor wrote: On Mon, Aug 18, 2008 at 9:31 AM, Ian Connor <[EMAIL PROTECTED]> wrote: I don't think this patch is working yet. If I take a shard out of rotation (even just one out of four), I get an error: org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection refused It's my understanding that SOLR-502 is really only concerned with queries timing out (i.e. they connect but take over N seconds to return) If the connection gets refused then a non-solr java connection exception is thrown. Something would have to get put in that (optionally) catches connection errors and still builds the response from the shards that did respond. On Fri, Aug 15, 2008 at 1:23 PM, Brian Whitman <[EMAIL PROTECTED] > wrote: I was going to file a ticket like this: "A SOLR-303 query with &shards=host1,host2,host3 when host3 is down returns an error. One of the advantages of a shard implementation is that data can be stored redundantly across different shards, either as direct copies (e.g. when host1 and host3 are snapshooter'd copies of each other) or where there is some "data RAID" that stripes indexes for redundancy." But then I saw SOLR-502, which appears to be committed. If I have the above scenario (host1,host2,host3 where host3 is not up) and set a timeAllowed, will I still get a 400 or will it come back with "partial" results? If not, can we think of a way to get this to work? It's my understanding already that duplicate docIDs are merged in the SOLR-303 response, so other than building in some "this host isn't working, just move on and report it" and of course the work to index redundantly, we wouldn't need anything to achieve a good redundant shard implementation. B -- Regards, Ian Connor -- Regards, Ian Connor -- http://variogr.am/
Re: partialResults, distributed search & SOLR-50
On Aug 18, 2008, at 12:31 PM, Yonik Seeley wrote: On Mon, Aug 18, 2008 at 12:16 PM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Yes, as far as I know, what Brian said is correct. Also, as far as I know, there is nothing that gracefully handles problematic Solr instances during distributed search. Right... we punted that issue to a load balancer (which assumes that you have more than one copy of each shard). Can you explain how you have a LB handling shards? Do you put a separate LB in front of each group of replica shards?
Re: which shard is a result coming from
On Aug 19, 2008, at 8:49 AM, Ian Connor wrote: What is the current "special requestHandler" that you can set currently? If you're referring to my issue post, that's just something we have internally (not in trunk solr) that we use instead of /update -- it just inserts a hostname:port/solr into the incoming XML doc add stream. Not very clean but it works. Use lars's patch.
in a RequestHandler's init, how to get solr data dir?
I want to be able to store non-solr data in solr's data directory (like solr/solr/data/stored alongside solr/solr/data/index) The java class that sets up this data is instantiated from a RequestHandlerBase class like: public class StoreDataHandler extends RequestHandlerBase { StoredData storedData; @Override public void init(NamedList args) { super.init(args); String dataDirectory = storedData = new StoredData(dataDirectory); } @Override public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception ... req.getCore() etc will eventually get me solr's data directory location, but how do I get it in the init method? I want to init the data store once on solr launch, not on every call. What do I replace those above with?
Re: in a RequestHandler's init, how to get solr data dir?
On Aug 26, 2008, at 12:24 PM, Shalin Shekhar Mangar wrote: Hi Brian, You can implement the SolrCoreAware interface which will give you access to the SolrCore object through the SolrCoreAware#inform method you will need to implement. It is called after the init method. Shalin, that worked. Thanks a ton!
Re: Adding a field?
On Aug 26, 2008, at 3:09 PM, Jon Drukman wrote: Is there a way to add a field to an existing index without stopping the server, deleting the index, and reloading every document from scratch? You can add a field to the schema at any time without adversely affecting the rest of the index. You have to restart the server, but you don't have to re-index existing documents. Of course, only new documents with that field specified will come back in queries. You can also define dynamic fields like x_* which would let you add any field name you want without restarting the server.
UpdateRequestProcessorFactory / Chain etc
Trying to build a simple UpdateRequestProcessor that keeps a field (the time of original index) when overwriting a document. 1) Can I make a updateRequestProcessor chain only work as a certain handler or does putting the following in my solrconfig.xml: Just handle all document updates? 2) Does a UpdateRequestProcessor support inform ?
Re: UpdateRequestProcessorFactory / Chain etc
Answered my own qs, I think: Trying to build a simple UpdateRequestProcessor that keeps a field (the time of original index) when overwriting a document. 1) Can I make a updateRequestProcessor chain only work as a certain handler or does putting the following in my solrconfig.xml: Just handle all document updates? What you have to do is: class="solr.XmlUpdateRequestHandler" > KeepIndexed And then calls to /update2 will go through the chain. Calls to /update will not. 2) Does a UpdateRequestProcessor support inform ? No, not that I can tell. And the factory won't get instantiated until the first time you use it.
Re: UpdateRequestProcessorFactory / Chain etc
Hm... I seem to be having trouble getting either the Factory or the Processor to do an init() for me. The end result I'd like to see is a function that gets called only once, either on solr init or the first time the handler is called. I can't seem to do that. I have these two classes: public class KeepIndexedDateFactory extends UpdateRequestProcessorFactory with a getInstance method and then class KeepIndexedDateProcessor extends UpdateRequestProcessor with a processAdd method The init() on both classes is never called, ever. The getInstance() method of the first class is called every time I add a doc, so I can't init stuff there. inform() of the first class is called if I add a implements SolrCoreAware -- but the class I need to instantiate once is only needed in the second class. I hope this makes sense -- java is not my first language.
Re: UpdateRequestProcessorFactory / Chain etc
On Sep 7, 2008, at 2:04 PM, Brian Whitman wrote: Hm... I seem to be having trouble getting either the Factory or the Processor to do an init() for me. The end result I'd like to see is a function that gets called only once, either on solr init or the first time the handler is called. I can't seem to do that. Here's my code, and a solution I think works -- is there a better way to do this: public class KeepIndexedDateFactory extends UpdateRequestProcessorFactory implements SolrCoreAware { DataClassIWantToInstantiateOnce data; KeepIndexedDateProcessor p; public void inform(SolrCore core) { data = new DataClassIWantToInstantiateOnce(null); } public UpdateRequestProcessor getInstance(SolrQueryRequest req, SolrQueryResponse rsp, UpdateRequestProcessor next) { KeepIndexedDateProcessor p = new KeepIndexedDateProcessor(next); p.associateData(data); return p; } } class KeepIndexedDateProcessor extends UpdateRequestProcessor { DataClassIWantToInstantiateOnce data; public KeepIndexedDateProcessor( UpdateRequestProcessor next) { super( next ); } public void associateData(Data d) { data = d; } @Override public void processAdd(AddUpdateCommand cmd) throws IOException { SolrInputDocument doc = cmd.getSolrInputDocument(); String id = doc.getFieldValue( "id" ).toString(); if( id != null ) { SolrQuery getIndexedDatesOfId = new SolrQuery(); getIndexedDatesOfId.setQuery("id:"+id); getIndexedDatesOfId.setFields("indexed"); getIndexedDatesOfId.setRows(1); QueryResponse qr = data.query(getIndexedDatesOfId); if(qr != null) { if(qr.getResults() != null) { if(qr.getResults().size()>0) { Date thisIndexed = (Date)qr.getResults().get(0).getFieldValue("indexed"); doc.setField("indexed", thisIndexed); } } } } // pass it up the chain super.processAdd(cmd); } }
RequestHandler that passes along the query
Not sure if this is possible or easy: I want to make a requestHandler that acts just like select but does stuff with the output before returning it to the client. e.g. http://url/solr/myhandler?q=type:dog&sort=legsdesc&shards=dogserver1;dogserver2 When myhandler gets it, I'd like to take the results of that query as if I sent it to select, then do stuff with the output before returning it. For example, it would add a field to each returned document from an external data store. This is sort of like an UpdateRequestProcessor chain thing, but for the select side. Is this possible? Alternately, I could have my custom RequestHandler do the query. But all I have in the RequestHandler is a SolrQueryRequest. Can I pass that along to something and get a SolrDocumentList back?
Re: RequestHandler that passes along the query
Thanks grant and ryan, so far so good. But I am confused about one thing - when I set this up like: public void process(ResponseBuilder rb) throws IOException { And put it as the last-component on a distributed search (a defaults shard is defined in the solrconfig for the handler), the component never does its thing. I looked at the TermVectorComponent implementation and it instead defines public int distributedProcess(ResponseBuilder rb) throws IOException { And when I implemented that method it works. Is there a way to define just one method that will work with both distributed and normal searches? On Fri, Oct 3, 2008 at 4:41 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > No need to even write a new ReqHandler if you're using 1.3: > http://wiki.apache.org/solr/SearchComponent >
Re: RequestHandler that passes along the query
Sorry for the extended question, but I am having trouble making SearchComponent that can actually get at the returned response in a distributed setup. In my distributedProcess: public int distributedProcess(ResponseBuilder rb) throws IOException { How can I get at the returned results from all shards? I want to get at really the rendered response right before it goes back to the client so I can add some information based on what came back. The TermVector example seems to get at rb.resultIds (which is not public and I can't use in my plugin) and then sends a request back to the shards to get the stored fields (using ShardDoc.id, another field I don't have access to.) Instead of doing all of that I'd like to just "peek" into the response that is about to be written to the client. I tried getting at rb.rsp but the data is not filled in during the last stage (GET_FIELDS) that distributedProcess gets called for. On Sat, Oct 4, 2008 at 10:12 AM, Brian Whitman <[EMAIL PROTECTED]> wrote: > Thanks grant and ryan, so far so good. But I am confused about one thing - > when I set this up like: > > public void process(ResponseBuilder rb) throws IOException { > > And put it as the last-component on a distributed search (a defaults shard > is defined in the solrconfig for the handler), the component never does its > thing. I looked at the TermVectorComponent implementation and it instead > defines > > public int distributedProcess(ResponseBuilder rb) throws IOException { > > And when I implemented that method it works. Is there a way to define just > one method that will work with both distributed and normal searches? > > > > On Fri, Oct 3, 2008 at 4:41 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > >> No need to even write a new ReqHandler if you're using 1.3: >> http://wiki.apache.org/solr/SearchComponent >> >
Re: RequestHandler that passes along the query
The issue I think is that process() is never called in my component, just distributedProcess. The server that hosts the component is a separate solr instance from the shards, so my guess is process() is only called when that particular solr instance has something to do with the index. distributedProcess() is called for each of those stages, but the last stage it is called for is GET_FIELDS. But the WritingDistributedSearchComponents page did tip me off to a new function, finishStage, that is called *after* each stage is done and does exactly what I want: @Override public void finishStage(ResponseBuilder rb) { if(rb.stage == ResponseBuilder.STAGE_GET_FIELDS) { SolrDocumentList sd = (SolrDocumentList) rb.rsp.getValues().get( "response"); for (SolrDocument d : sd) { rb.rsp.add("second-id-list", d.getFieldValue("id").toString()); } } } On Sat, Oct 4, 2008 at 1:37 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: > I'm not totally on top of how distributed components work, but check: > http://wiki.apache.org/solr/WritingDistributedSearchComponents > > and: > https://issues.apache.org/jira/browse/SOLR-680 > > Do you want each of the shards to append values? or just the final result? > If appending the values is not a big resource hog, it may make sense to > only do that in the main "process" block. If that is the case, I *think* > you just implement: process(ResponseBuilder rb) > > ryan > > > > On Oct 4, 2008, at 1:06 PM, Brian Whitman wrote: > > Sorry for the extended question, but I am having trouble making >> SearchComponent that can actually get at the returned response in a >> distributed setup. >> In my distributedProcess: >> >> public int distributedProcess(ResponseBuilder rb) throws IOException { >> >> How can I get at the returned results from all shards? I want to get at >> really the rendered response right before it goes back to the client so I >> can add some information based on what came back. >> >> The TermVector example seems to get at rb.resultIds (which is not public >> and >> I can't use in my plugin) and then sends a request back to the shards to >> get >> the stored fields (using ShardDoc.id, another field I don't have access >> to.) >> Instead of doing all of that I'd like to just "peek" into the response >> that >> is about to be written to the client. >> >> I tried getting at rb.rsp but the data is not filled in during the last >> stage (GET_FIELDS) that distributedProcess gets called for. >> >> >> >> On Sat, Oct 4, 2008 at 10:12 AM, Brian Whitman <[EMAIL PROTECTED]> >> wrote: >> >> Thanks grant and ryan, so far so good. But I am confused about one thing >>> - >>> when I set this up like: >>> >>> public void process(ResponseBuilder rb) throws IOException { >>> >>> And put it as the last-component on a distributed search (a defaults >>> shard >>> is defined in the solrconfig for the handler), the component never does >>> its >>> thing. I looked at the TermVectorComponent implementation and it instead >>> defines >>> >>> public int distributedProcess(ResponseBuilder rb) throws IOException { >>> >>> And when I implemented that method it works. Is there a way to define >>> just >>> one method that will work with both distributed and normal searches? >>> >>> >>> >>> On Fri, Oct 3, 2008 at 4:41 PM, Grant Ingersoll <[EMAIL PROTECTED] >>> >wrote: >>> >>> No need to even write a new ReqHandler if you're using 1.3: >>>> http://wiki.apache.org/solr/SearchComponent >>>> >>>> >>> >
maxCodeLen in the doublemetaphone solr analyzer
I want to change the maxCodeLen param that is in Solr 1.3's doublemetaphone plugin. Doc is here: http://commons.apache.org/codec/apidocs/org/apache/commons/codec/language/DoubleMetaphone.html Is this something I can do in solrconfig or do I need to change it and recompile?
Re: maxCodeLen in the doublemetaphone solr analyzer
oh, thanks! I didn't see that patch. On Thu, Nov 13, 2008 at 3:40 PM, Feak, Todd <[EMAIL PROTECTED]> wrote: > There's a patch in to do that as a separate filter. See > https://issues.apache.org/jira/browse/SOLR-813 >
matching exact terms
This is probably severe user error, but I am curious about how to index docs to make this query work: happy birthday to return the doc with n_name:"Happy Birthday" before the doc with n_name:"Happy Birthday, Happy Birthday" . As it is now, the latter appears first for a query of n_name:"happy birthday", the former second. It would be great to do this at query time instead of having to re-index, but I will if I have to! The n_* type is defined as:
cannot allocate memory for snapshooter
I have an indexing machine on a test server (a mid-level EC2 instance, 8GB of RAM) and I run jetty like: java -server -Xms5g -Xmx5g -XX:MaxPermSize=128m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap -Dsolr.solr.home=/vol/solr -Djava.awt.headless=true -jar start.jar The indexing master is set to snapshoot on commit. Sometimes (not always) the snapshot fails with SEVERE: java.io.IOException: Cannot run program "/vol/solr/bin/snapshooter": java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(Unknown Source) Why would snapshooter need more than 2GB ram? /proc/meminfo says (with solr running & nothing else) MemTotal: 7872040 kB MemFree: 2018404 kB Buffers: 67704 kB Cached:2161880 kB SwapCached: 0 kB Active:3446348 kB Inactive: 2186964 kB SwapTotal: 0 kB SwapFree:0 kB Dirty: 8 kB Writeback: 0 kB AnonPages: 3403728 kB Mapped: 12016 kB Slab:37804 kB SReclaimable:20048 kB SUnreclaim: 17756 kB PageTables: 7476 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 3936020 kB Committed_AS: 5383624 kB VmallocTotal: 34359738367 kB VmallocUsed: 340 kB VmallocChunk: 34359738027 kB
debugging long commits
We have a distributed setup that has been experiencing glacially slow commit times on only some of the shards. (10s on a good shard, 263s on a slow shard.) Each shard for this index has about 10GB of lucene index data and the documents are segregated by an md5 hash, so the distribution of document/data types should be equal across all shards. I've turned off our postcommit hooks to isolate the problem, so it's not a snapshot run amok or anything. I also moved the indexes over to new machines and the same indexes that were slow in production are also slow on the test machines. During the slow commit, the jetty process is 100% CPU / 50% RAM on a 8GB quad core machine. The slow commit happens every time after I add at least one document. (If I don't add any documents the commit is immediate.) What can I do to look into this problem?
Re: debugging long commits
ng on condition [0x..0x409303e0] java.lang.Thread.State: RUNNABLE "Signal Dispatcher" daemon prio=10 tid=0x2aabf9337400 nid=0x5da9 waiting on condition [0x..0x408306b0] java.lang.Thread.State: RUNNABLE "Finalizer" daemon prio=10 tid=0x2aabf9314400 nid=0x5da8 in Object.wait() [0x4072f000..0x4072faa0] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaabeb86f50> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(Unknown Source) - locked <0x2aaabeb86f50> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(Unknown Source) at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source) "Reference Handler" daemon prio=10 tid=0x2aabf9312800 nid=0x5da7 in Object.wait() [0x4062e000..0x4062ed20] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x2aaabeb86ec8> (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Unknown Source) - locked <0x2aaabeb86ec8> (a java.lang.ref.Reference$Lock) "VM Thread" prio=10 tid=0x2aabf917a000 nid=0x5da6 runnable "GC task thread#0 (ParallelGC)" prio=10 tid=0x4011c800 nid=0x5da4 runnable "GC task thread#1 (ParallelGC)" prio=10 tid=0x4011e000 nid=0x5da5 runnable "VM Periodic Task Thread" prio=10 tid=0x2aabf9342000 nid=0x5dad waiting on condition JNI global references: 971 Heap PSYoungGen total 1395264K, used 965841K [0x2aab8d4b, 0x2aabf7f5, 0x2aabf7f5) eden space 1030400K, 93% used [0x2aab8d4b,0x2aabc83d4788,0x2aabcc2f) from space 364864K, 0% used [0x2aabe1b0,0x2aabe1b1,0x2aabf7f5) to space 352320K, 0% used [0x2aabcc2f,0x2aabcc2f,0x2aabe1b0) PSOldGentotal 3495296K, used 642758K [0x2aaab7f5, 0x2aab8d4b, 0x2aab8d4b) object space 3495296K, 18% used [0x2aaab7f5,0x2aaadf301a78,0x2aab8d4b) PSPermGen total 21248K, used 19258K [0x2ff5, 0x2aaab141, 0x2aaab7f5) object space 21248K, 90% used [0x2ff5,0x2aaab121e8d8,0x2aaab141) num #instances #bytes class name -- 1: 6459678 491568792 [C 2: 6456059 258242360 java.lang.String 3: 6282264 251290560 org.apache.lucene.index.TermInfo 4: 6282189 201030048 org.apache.lucene.index.Term 5: 70220 39109632 [I 6: 6082 25264288 [B 7: 149 20355504 [J 8: 133 20354208 [Lorg.apache.lucene.index.Term; 9: 133 20354208 [Lorg.apache.lucene.index.TermInfo; 10:1602308972880 java.nio.HeapByteBuffer 11:1602188972208 java.nio.HeapCharBuffer 12:1602108971760 org.apache.lucene.index.FieldsReader$FieldForMerge 13: 304404095480 14: 304403660128 15: 26053026184 16: 220653025120 [Ljava.lang.Object; 17: 12972411792 [Ljava.util.HashMap$Entry; 18: 486912309696 19: 26041981728 20: 21941889888 21: 274441317312 java.util.HashMap$Entry 22: 24954 998160 java.util.AbstractList$Itr 23: 18834 753360 org.apache.lucene.index.FieldInfo 24: 2846 523664 java.lang.Class 25: 13021 520840 java.util.ArrayList 26: 12471 399072 org.apache.lucene.document.Document 27: 3895 372216 [[I 28: 3904 309592 [S 29: 534 249632 30: 3451 220864 org.apache.lucene.index.SegmentReader$Norm 31: 1547 136136 org.apache.lucene.store.FSDirectory$FSIndexInput 32: 213 120984 33: 737 112024 java.lang.reflect.Method 34: 1575 100800 java.lang.ref.Finalizer 35: 1345 86080 org.apache.lucene.store.FSDirectory$FSIndexInput$Descriptor 36: 1188 76032 java.util.HashMap ... On Fri, Jan 2, 2009 at 11:20 AM, Brian Whitman wrote: > We have a distributed setup that has been experiencing glacially slow > commit times on only some of the shards. (10s on a good shard, 263s on a > slow shard.) Each shard for this index has about 10GB of lucene index data > and the documents are segregated by an md5 hash, so the distribution of > document/data types should be equal across all shards. I've turned off our > postcommit hooks to isolate the problem, so it's not a snapshot run amok or > anything. I also moved the indexes over to new machines and the same indexes > that were slow in production are also slow on the test machines. > During the slow commit, the jetty process is 100% CPU / 50% RAM on a 8GB > quad core machine. The slow commit happens every time after I add at least > one document. (If I don't add any documents the commit is immediate.) > > What can I do to look into this problem? > > > >
Re: debugging long commits
I think I'm getting close with this (sorry for the self-replies) I tried an optimize (which we never do) and it took 30m and said this a lot: Exception in thread "Lucene Merge Thread #4" org.apache.lucene.index.MergePolicy$MergeException: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 34950 at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:314) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:291) Caused by: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 34950 at org.apache.lucene.util.BitVector.get(BitVector.java:91) at org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:125) at org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98) at org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:633) at org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:585) at org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:546) at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:499) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4291) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3932) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:205) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:260) Jan 2, 2009 6:05:49 PM org.apache.solr.common.SolrException log SEVERE: java.io.IOException: background merge hit exception: _ks4:C2504982 _oaw:C514635 _tll:C827949 _tdx:C18372 _te8:C19929 _tej:C22201 _1agw:C1717926 into _1agy [optimize] at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2346) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:2280) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:355) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcesso ... But then it finished. And now commits are OK again. Anyone know what the merge hit exception means and if i lost anything?
Re: cannot allocate memory for snapshooter
Thanks for the pointer. (It seems really weird to alloc 5GB of swap just because the JVM needs to run a shell script.. but I get hoss's explanation in the following post) On Fri, Jan 2, 2009 at 2:37 PM, Bill Au wrote: > add more swap space: > http://www.nabble.com/Not-enough-space-to11423199.html#a11424938 > > Bill > > On Fri, Jan 2, 2009 at 10:52 AM, Brian Whitman wrote: > > > I have an indexing machine on a test server (a mid-level EC2 instance, > 8GB > > of RAM) and I run jetty like: > > > > java -server -Xms5g -Xmx5g -XX:MaxPermSize=128m > > -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap > > -Dsolr.solr.home=/vol/solr -Djava.awt.headless=true -jar start.jar > > > > The indexing master is set to snapshoot on commit. Sometimes (not always) > > the snapshot fails with > > > > SEVERE: java.io.IOException: Cannot run program > > "/vol/solr/bin/snapshooter": > > java.io.IOException: error=12, Cannot allocate memory > > at java.lang.ProcessBuilder.start(Unknown Source) > > > > Why would snapshooter need more than 2GB ram? /proc/meminfo says (with > > solr > > running & nothing else) > > > > MemTotal: 7872040 kB > > MemFree: 2018404 kB > > Buffers: 67704 kB > > Cached:2161880 kB > > SwapCached: 0 kB > > Active:3446348 kB > > Inactive: 2186964 kB > > SwapTotal: 0 kB > > SwapFree:0 kB > > Dirty: 8 kB > > Writeback: 0 kB > > AnonPages: 3403728 kB > > Mapped: 12016 kB > > Slab:37804 kB > > SReclaimable:20048 kB > > SUnreclaim: 17756 kB > > PageTables: 7476 kB > > NFS_Unstable:0 kB > > Bounce: 0 kB > > CommitLimit: 3936020 kB > > Committed_AS: 5383624 kB > > VmallocTotal: 34359738367 kB > > VmallocUsed: 340 kB > > VmallocChunk: 34359738027 kB > > >
Re: cannot allocate memory for snapshooter
On Sun, Jan 4, 2009 at 9:47 PM, Mark Miller wrote: > Hey Brian, I didn't catch what OS you are using on EC2 by the way. I > thought most UNIX OS's were using memory overcommit - A quick search brings > up Linux, AIX, and HP-UX, and maybe even OSX? > > What are you running over there? EC2, so Linux I assume? > This is on debian, a 2.6.21 x86_64 kernel.
lazily loading search components?
We have a standard solr install that we use across a lot of different uses. In that install is a custom search component that loads a lot of data in its inform() method. This means the data is initialized on solr boot. Only about half of our installs actually ever call this search component, so the data sits around eating up heap. I could start splitting up our conf/ folders per solr install "type" but that seems wrong. I'd like to instead configure my search component to not have its inform() called until the first time it is actually called. Is this possible?
general survey of master/replica setups
Say you have a bunch of solr servers that index new data, and then some replica/"slave" setup that snappulls from the master on a cron or some schedule. Live internet facing queries hit the replica, not the master, as indexes/commits on the master slow down queries. But even the query-only solr installs need to "snap-install" every so often, triggering a commit, and there is a slowdown in queries when this happens. Measured avg QTimes during normal times are 400ms, during commit/snapinstall times they dip into the seconds. Say in the 5m between snappulls 1000 documents have been updated/deleted/added. How do people mitigate the effect of the commit on replica query instances?
arcane queryParser parseException
server:/solr/select?q=field:"''anything can go here;" --> Lexical error, encountered after : "\"\'\'anything can go here" server:/solr/select?q=field:"'anything' anything can go here;" --> Same problem server:/solr/select?q=field:"'anything' anything can go here\;" --> No problem (but ClientUtils's escape does not escape semicolons.) server:/solr/select?q=field:"anything can go here;" --> no problem server:/solr/select?q=field:"''anything can go here" --> no problem As far as I can tell, two apostrophes, then a semicolon causes the lexical error. There can be text within the apostrophes. If you leave out the semicolon it's ok. But you can keep the semicolon if you remove the two apostrophes. This is on trunk solr.
Re: arcane queryParser parseException
> > : I went ahead and added it since it does not hurt anything to escape more > : things -- it just makes the final string ugly. > > : In 1.3 the escape method covered everything: > > H good call, i didn't realize the escape method had been so > blanket in 1.3. this way we protect people who were using it in 1.3 and > relied on it to protect them from the legacy ";" behavior. Thanks hoss and ryan. That explains why the error was new to us-- we upgraded to trunk from 1.3 release and this exception came from a solrj processed query that used to work.
java.lang.NoSuchMethodError: org.apache.solr.common.util.ConcurrentLRUCache.getLatestAccessedItems(J)Ljava/util/Map;
Seeing this in the logs of an otherwise working solr instance. Commits are done automatically I believe every 10m or 1 docs. This is solr trunk (last updated last night) Any ideas? INFO: [] webapp=/solr path=/select params={fl=thingID,n_thingname,score&q=n_thingname:"Cornell+Dupree"^5+net_thingname:"Cornell+Dupree"^4+ne_thingname:"Cornell+Dupree"^2&wt=standard&fq=s_type:artist&rows=10&version=2.2} hits=2 status=0 QTime=37 Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy onCommit INFO: SolrDeletionPolicy.onCommit: commits:num=2 commit{dir=/vol/solr/data/index,segFN=segments_2cy,version=1224560226691,generation=3058,filenames=[_2yp.tvf, _2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, segments_2cy, _2yp.tii, _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf, _2yr.tii, _2yr.nrm, _2ys.tvd, _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd, _2yn.tii, _2yn.fdt, _2yq.prx, _2yo.tvd, _2yp.fnm, _2yo.tvf, _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, _2yp.prx, _2yn.tis, _2yq.nrm, _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, _2yq.tis, _2yo.fdx, _2yo.tii, _2yp.nrm, _2yq.tii, _2yr.frq, _2yr.prx, _2yo.tis, _2yp.fdt, _2yq.frq, _2yp.fdx, _2yq.fnm, _2yo.tvx, _2ys.tii, _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm, _2ys.tis, _2yr.tvd, _2yn_9.del, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt, _2yp.tvd] commit{dir=/vol/solr/data/index,segFN=segments_2cz,version=1224560226692,generation=3059,filenames=[_2yp.tvf, _2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, _2yn_a.del, segments_2cz, _2yt.tvf, _2yp.tii, _2yt.tvd, _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf, _2yr.tii, _2yr.nrm, _2ys.tvd, _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd, _2yn.tii, _2yn.fdt, _2yq.prx, _2yo.tvd, _2yt.tvx, _2yp.fnm, _2yo.tvf, _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, _2yp.prx, _2yn.tis, _2yq.nrm, _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, _2yq.tis, _2yo.fdx, _2yo.tii, _2yp.nrm, _2yq.tii, _2yr.frq, _2yt.nrm, _2yr.prx, _2yo.tis, _2yp.fdt, _2yq.frq, _2yt.fdx, _2yp.fdx, _2yt.fdt, _2yt.prx, _2yq.fnm, _2yo.tvx, _2ys.tii, _2yt.fnm, _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm, _2ys.tis, _2yt.tis, _2yr.tvd, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt, _2yt.tii, _2yt.frq, _2yp.tvd] Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1224560226692 Feb 24, 2009 5:05:53 PM org.apache.solr.search.SolrIndexSearcher INFO: Opening searc...@25ddfb6a main Feb 24, 2009 5:05:53 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: end_commit_flush Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/vol/solr/data/index,segFN=segments_2cz,version=1224560226692,generation=3059,filenames=[_2yp.tvf, _2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, _2yn_a.del, segments_2cz, _2yt.tvf, _2yp.tii, _2yt.tvd, _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf, _2yr.tii, _2yr.nrm, _2ys.tvd, _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd, _2yn.tii, _2yn.fdt, _2yq.prx, _2yo.tvd, _2yt.tvx, _2yp.fnm, _2yo.tvf, _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, _2yp.prx, _2yn.tis, _2yq.nrm, _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, _2yq.tis, _2yo.fdx, _2yo.tii, _2yp.nrm, _2yq.tii, _2yr.frq, _2yt.nrm, _2yr.prx, _2yo.tis, _2yp.fdt, _2yq.frq, _2yt.fdx, _2yp.fdx, _2yt.fdt, _2yt.prx, _2yq.fnm, _2yo.tvx, _2ys.tii, _2yt.fnm, _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm, _2ys.tis, _2yt.tis, _2yr.tvd, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt, _2yt.tii, _2yt.frq, _2yp.tvd] Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1224560226692 Feb 24, 2009 5:05:53 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NoSuchMethodError: org.apache.solr.common.util.ConcurrentLRUCache.getLatestAccessedItems(J)Ljava/util/Map; at org.apache.solr.search.FastLRUCache.getStatistics(FastLRUCache.java:244) at org.apache.solr.search.FastLRUCache.toString(FastLRUCache.java:260) at java.lang.String.valueOf(String.java:2827) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1645) at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1147) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) Feb 24, 2009 5:05:53 PM org.apache.solr.core.QuerySenderListener newSearcher INFO: QuerySenderListener sending requests to searc...@25ddfb6a main Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=null path=null params={start=0&q=solr&rows=10} hits=0 status=0 QTime=2 Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=null path=null params={start=0&q=rocks&rows=10} hits=0 status=0 QTime=0 Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=null path=null params={q=stat
Re: java.lang.NoSuchMethodError: org.apache.solr.common.util.ConcurrentLRUCache.getLatestAccessedItems(J)Ljava/util/Map;
Yep, did ant clean, made sure all the solr-libs were current, no more exception. Thanks ryan & mark On Tue, Feb 24, 2009 at 1:47 PM, Ryan McKinley wrote: > i hit that one too! > > try: ant clean > > > > On Feb 24, 2009, at 12:08 PM, Brian Whitman wrote: > > Seeing this in the logs of an otherwise working solr instance. Commits are >> done automatically I believe every 10m or 1 docs. This is solr trunk >> (last updated last night) Any ideas? >> >> >> >> INFO: [] webapp=/solr path=/select >> >> params={fl=thingID,n_thingname,score&q=n_thingname:"Cornell+Dupree"^5+net_thingname:"Cornell+Dupree"^4+ne_thingname:"Cornell+Dupree"^2&wt=standard&fq=s_type:artist&rows=10&version=2.2} >> hits=2 status=0 QTime=37 >> Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy onCommit >> INFO: SolrDeletionPolicy.onCommit: commits:num=2 >> >> commit{dir=/vol/solr/data/index,segFN=segments_2cy,version=1224560226691,generation=3058,filenames=[_2yp.tvf, >> _2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, segments_2cy, _2yp.tii, >> _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf, _2yr.tii, _2yr.nrm, _2ys.tvd, >> _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd, _2yn.tii, _2yn.fdt, _2yq.prx, >> _2yo.tvd, _2yp.fnm, _2yo.tvf, _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, >> _2yp.prx, _2yn.tis, _2yq.nrm, _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, >> _2yq.tis, _2yo.fdx, _2yo.tii, _2yp.nrm, _2yq.tii, _2yr.frq, _2yr.prx, >> _2yo.tis, _2yp.fdt, _2yq.frq, _2yp.fdx, _2yq.fnm, _2yo.tvx, _2ys.tii, >> _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm, _2ys.tis, _2yr.tvd, >> _2yn_9.del, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt, _2yp.tvd] >> >> commit{dir=/vol/solr/data/index,segFN=segments_2cz,version=1224560226692,generation=3059,filenames=[_2yp.tvf, >> _2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, _2yn_a.del, >> segments_2cz, >> _2yt.tvf, _2yp.tii, _2yt.tvd, _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf, >> _2yr.tii, _2yr.nrm, _2ys.tvd, _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd, >> _2yn.tii, _2yn.fdt, _2yq.prx, _2yo.tvd, _2yt.tvx, _2yp.fnm, _2yo.tvf, >> _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, _2yp.prx, _2yn.tis, _2yq.nrm, >> _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, _2yq.tis, _2yo.fdx, _2yo.tii, >> _2yp.nrm, _2yq.tii, _2yr.frq, _2yt.nrm, _2yr.prx, _2yo.tis, _2yp.fdt, >> _2yq.frq, _2yt.fdx, _2yp.fdx, _2yt.fdt, _2yt.prx, _2yq.fnm, _2yo.tvx, >> _2ys.tii, _2yt.fnm, _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm, >> _2ys.tis, _2yt.tis, _2yr.tvd, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt, >> _2yt.tii, _2yt.frq, _2yp.tvd] >> Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy >> updateCommits >> INFO: last commit = 1224560226692 >> Feb 24, 2009 5:05:53 PM org.apache.solr.search.SolrIndexSearcher >> INFO: Opening searc...@25ddfb6a main >> Feb 24, 2009 5:05:53 PM org.apache.solr.update.DirectUpdateHandler2 commit >> INFO: end_commit_flush >> Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy onInit >> INFO: SolrDeletionPolicy.onInit: commits:num=1 >> >> commit{dir=/vol/solr/data/index,segFN=segments_2cz,version=1224560226692,generation=3059,filenames=[_2yp.tvf, >> _2ys.prx, _2yo.fnm, _2yo.frq, _2yp.tvx, _2yn.fnm, _2yn_a.del, >> segments_2cz, >> _2yt.tvf, _2yp.tii, _2yt.tvd, _2yr.tis, _2ys.tvf, _2yp.frq, _2yn.tvf, >> _2yr.tii, _2yr.nrm, _2ys.tvd, _2yn.fdx, _2yp.tis, _2yn.prx, _2yn.tvd, >> _2yn.tii, _2yn.fdt, _2yq.prx, _2yo.tvd, _2yt.tvx, _2yp.fnm, _2yo.tvf, >> _2yr.fdt, _2ys.frq, _2yn.nrm, _2yr.fdx, _2yp.prx, _2yn.tis, _2yq.nrm, >> _2ys.tvx, _2ys.fnm, _2yo.fdt, _2yn.tvx, _2yq.tis, _2yo.fdx, _2yo.tii, >> _2yp.nrm, _2yq.tii, _2yr.frq, _2yt.nrm, _2yr.prx, _2yo.tis, _2yp.fdt, >> _2yq.frq, _2yt.fdx, _2yp.fdx, _2yt.fdt, _2yt.prx, _2yq.fnm, _2yo.tvx, >> _2ys.tii, _2yt.fnm, _2yo.prx, _2yr.tvx, _2yn.frq, _2ys.nrm, _2yo.nrm, >> _2ys.tis, _2yt.tis, _2yr.tvd, _2yr.tvf, _2yr.fnm, _2ys.fdx, _2ys.fdt, >> _2yt.tii, _2yt.frq, _2yp.tvd] >> Feb 24, 2009 5:05:53 PM org.apache.solr.core.SolrDeletionPolicy >> updateCommits >> INFO: last commit = 1224560226692 >> Feb 24, 2009 5:05:53 PM org.apache.solr.common.SolrException log >> SEVERE: java.lang.NoSuchMethodError: >> >> org.apache.solr.common.util.ConcurrentLRUCache.getLatestAccessedItems(J)Ljava/util/Map; >> at >> org.apache.solr.search.FastLRUCache.getStatistics(FastLRUCache.java:244) >> at org.apache.solr.search.FastLRUCache.toString(FastLRUCache.java:260) >> at java.lang.String.valueOf(String.java:2827) >> at java.lang.StringBuilder.append(StringBuilder.java:115) >> at >> org.apache.solr.
maxCodeLength in PhoneticFilterFactory
i have this version of solr running: Solr Implementation Version: 1.4-dev 747554M - bwhitman - 2009-02-24 16:37:49 and am trying to update a schema to support 8 code length metaphone instead of 4 via this (committed) issue: https://issues.apache.org/jira/browse/SOLR-813 So I change the schema to this (knowing that I have to reindex) But when I do queries fail with Error_initializing_DoubleMetaphoneclass_orgapachecommonscodeclanguageDoubleMetaphone__at_orgapachesolranalysisPhoneticFilterFactoryinitPhoneticFilterFactoryjava90__at_orgapachesolrschemaIndexSchema$6initIndexSchemajava821__at_orgapachesolrschemaIndexSchema$6initIndexSchemajava817__at_orgapachesolrutilpluginAbstractPluginLoaderloadAbstractPluginLoaderjava149__at_orgapachesolrschemaIndexSchemareadAnalyzerIndexSchemajava831__at_orgapachesolrschemaIndexSchemaaccess$100IndexSchemajava58__at_orgapachesolrschemaIndexSchema$1createIndexSchemajava425__at_orgapachesolrschemaIndexSchema$1createIndexSchemajava410__at_orgapachesolrutilpluginAbstractPluginLoaderloadAbstractPluginLoaderjava141__at_orgapachesolrschemaIndexSchemareadSchemaIndexSchemajava452__at_orgapachesolrschemaIndexSchemainitIndexSchemajava95__at_orgapachesolrcoreSolrCoreinitSolrCorejava501__at_orgapachesolrcoreCoreContainer$InitializerinitializeCoreContainerjava121
Re: maxCodeLength in PhoneticFilterFactory
yep, that did it. Thanks very much yonik. On Sat, Apr 11, 2009 at 10:27 PM, Yonik Seeley wrote: > OK, should hopefully be fixed in trunk. > > > -Yonik > http://www.lucidimagination.com > > > On Sat, Apr 11, 2009 at 9:16 PM, Yonik Seeley wrote: > > There's definitely a bug - I just reproduced it. Nothing obvious > > jumps out at me... and there's no error in the logs either (that's > > another bug it would seem). Could you open a JIRA issue for this? > > > > > > -Yonik > > http://www.lucidimagination.com > > > > > > > > On Fri, Apr 10, 2009 at 6:54 PM, Brian Whitman > wrote: > >> i have this version of solr running: > >> > >> Solr Implementation Version: 1.4-dev 747554M - bwhitman - 2009-02-24 > >> 16:37:49 > >> > >> and am trying to update a schema to support 8 code length metaphone > instead > >> of 4 via this (committed) issue: > >> > >> https://issues.apache.org/jira/browse/SOLR-813 > >> > >> So I change the schema to this (knowing that I have to reindex) > >> > >> encoder="DoubleMetaphone" > >> inject="true" maxCodeLength="8"/> > >> > >> But when I do queries fail with > >> > >> > Error_initializing_DoubleMetaphoneclass_orgapachecommonscodeclanguageDoubleMetaphone__at_orgapachesolranalysisPhoneticFilterFactoryinitPhoneticFilterFactoryjava90__at_orgapachesolrschemaIndexSchema$6initIndexSchemajava821__at_orgapachesolrschemaIndexSchema$6initIndexSchemajava817__at_orgapachesolrutilpluginAbstractPluginLoaderloadAbstractPluginLoaderjava149__at_orgapachesolrschemaIndexSchemareadAnalyzerIndexSchemajava831__at_orgapachesolrschemaIndexSchemaaccess$100IndexSchemajava58__at_orgapachesolrschemaIndexSchema$1createIndexSchemajava425__at_orgapachesolrschemaIndexSchema$1createIndexSchemajava410__at_orgapachesolrutilpluginAbstractPluginLoaderloadAbstractPluginLoaderjava141__at_orgapachesolrschemaIndexSchemareadSchemaIndexSchemajava452__at_orgapachesolrschemaIndexSchemainitIndexSchemajava95__at_orgapachesolrcoreSolrCoreinitSolrCorejava501__at_orgapachesolrcoreCoreContainer$InitializerinitializeCoreContainerjava121 > >> > > >
python response handler treats "unschema'd" fields differently
I have a solr index where we removed a field from the schema but it still had some documents with that field in it. Queries using the standard response handler had no problem but the &wt=python handler would break on any query (with fl="*" or asking for that field directly) with: SolrHTTPException: HTTP code=400, reason=undefined_field_oldfield I "fixed" it by putting that field back in the schema. One related weirdness is that fl=oldfield would cause the exception but not fl=othernonschemafield -- that is, it would only break on field names that were not in schema but were in the documents. I know this is undefined behavior territory but it was still weird that the standard response writer does not do this-- if you give a nonexistent field name to fl on wt=standard, either one that is in documents or is not -- it happily performs the query just skipping the ones that are not in the schema.
index time boosting on multivalued fields
I can set the boost of a field or doc at index time using the boost attr in the update message, e.g. pet But that won't work for multivalued fields according to the RelevancyFAQ pet animal ( I assume it applies the last boost parsed to all terms? ) Now, say I'd like to do index-time boosting of a multivalued field with each value having a unique boost. I could simply index the field multiple times: pet pet animal But is there a more exact way?
Re: Pagination of results and XSLT.
Has anyone tried to handle pagination of results using XSLT's ? I'm not really sure it is possible to do it in pure XSLT because all the response object gives us is a total document count - paginating the results would involve more than what XSLT 1.0 could handle (I'll be very happy if someone proves me wrong :)). We do pagination in XSL 1.0 often -- direct from a solr response right to HTML/CSS/JS. You get both the start and total rows from the solr response, so I don't know what else you'd need. Here's a snip of a paging XSL in solr. The referred JS function pageResults just sets the &start= solr param. 0 0 0 select="$startAt"/>
Re: Pagination of results and XSLT.
On Jul 24, 2007, at 5:20 AM, Ard Schrijvers wrote: I have been using similar xsls like you describe below in the past, butI think after 3 years of using it I came to realize (500 internal server error) that it can lead to nasty errors when you have a recursive call like (though I am not sure wether it depends on your xslt processor, al least xalan has the problem) Yes -- in the public facing apps we have we limit the page counter to n + 10. Not sure if this is a Solr thing to fix, I've been told many times never have solr go right out to xsl to html, so conceivably you'd have a "real" web app in between that can easily do paging. -b
Re: boost field without dismax
Jul 24, 2007, at 9:42 AM, Alessandro Ferrucci wrote: is there a way to boost a field much like is done in dismax request handler? I've tried doing index-time boosting by providing the boost to the field as an attribute in the add doc but that did nothing to affect the score when I went to search. I do not want to use dismax since I also want wildcard patterns supported. What I'd like to do is provide boosting of a last-name field when I do a search. something not like: firstname:alessandro lastname:ferrucci^5 ?
Re: XML parsing error
On Jul 26, 2007, at 11:25 AM, Yonik Seeley wrote: OK, then perhaps it's a jetty bug with charset handling. I'm using resin btw Could you run the same query, but use the python output? wt=python Seems to be OK: {'responseHeader':{'status':0,'QTime':0,'params':{'start':'7','fl':'c ontent','q':'"Pez"~1','rows':'1','wt':'python'}},'response':{'num Found':5381,'start':7,'docs':[{'content':u'Akatsuki - PE\'Z \ufffd\uf ffd\ufffd \ufffd\ufffd\ufffd \ufffd\ufffd\u04b3 | \ufffd\ufffd\ufffd\ ufffd\ufffd\ufffd\u0333 | \ufffd\u057a\ufffd\ufffd\ufffd\ufffd\ufffd | \u0177\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd | >>> Akatsuki - PE\'Z \ ufffd\ufffd\ufffd \ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\u05e8\ufffd\u fffd \ufffd|\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\u0438\ufffd |\ ufffd \ufffd\ufffd\ufffd\ufffd\u016e\ufffd\ufffd |\ufffd \ufffd\ u05b6\ufffd\ufffd\ufffd\ufffd |\ufffd \ufffd\u057a\ufffd\ufffd\u fffd\ufffd\ufffd |\ufffd \ufffd\u00b8\ufffd\ufffd\ufffd\ufffd\uf ffd |\ufffd t\ufffd\u04fa\ufffd\ufffd\ufffd \ufffd\ufffd \ufffd\ ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd |\ufffd \ufffd\ufffd\u 03f7\ufffd\ufffd\ufffd\ufffd |\ufffd \u04f0\ufffd\ufffd\ufffd\uf ffd\ufffd\ufffd |\ufffd \ufffd\u03fc\ufffd\ufffd\ufffd\ufffd\uff fd |\ufffd \u0177\ufffd>\ufffd\ufffd\ufffd |\ufffd \ufffd\u 03f8\ufffd\ufffd\ufffd\ufffd\ufffd |\ufffd \ufffd\ufffd\u0475\uf ffd\ufffd \u0177\ufffd>\ufffd\ufffd\ufffd > Various Artists[2005] >\u fffd\ufffd Now Jazz 3 - That\'s What I Call Jazz \ufffd\ufffd> Akatsu ki - PE\'Z \ufffd\ufffd\ufffd Akatsuki - PE\'Z \ufffd\ufffd\ufffd \uf ffd \ufffd\ufffd \ufffd \ufffd\ufffd \ufffd \ufffd\ufffd \ufffd \ufff d\ufffd\ufffd\ufffd\u05e8\ufffd\ufffd\ufffd\ufffd \ufffd\ufffdNow Jaz z 3 - That\'s What I Call Jazz\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\u fffd\u0773\ufffd\ufffd\ufffd\ufffd\u05a3\ufffd Various Artists[2005] Akatsuki - PE\'Z \ufffd\ufffd\ufffd\ufffd\ufffd\u0231\ufffd\ufffd \uf ffd\ufffd\ufffd\u01fb\u1fa1\ufffd\uccb9\ufffd\ufffd\ufffd\ufffd\u0231 \ufffd\u0138\ufffd\u02a3\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u 04b5\ufffd\ufffd\u02f8\u00f8\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\uff fd\ufffd\ufffd\u04f8\u00f8\ufffd\ufffd>>> \ufffd\ufffd\ufffd\ufffd\uf ffd\ufffd\u04bb\ufffd\ufffd\ufffd\ufffd\ud8e1'}]}}
Re: XML parsing error
On Jul 26, 2007, at 11:10 AM, Yonik Seeley wrote: If the '<' truely got destroyed, it's a server (Solr or Jetty) bug. One possibility is that the '<' does exist, but due to a charset mismatch, it's being slurped into a multi-byte char. Just dumped it with curl and did a hexdump: 5a0 t ; & g t ; & g t ; 357 277 275 357 277 5b0 275 357 277 275 357 277 275 357 277 275 357 277 275 322 273 357 5c0 277 275 357 277 275 357 277 275 357 277 275 361 210 220 274 / 5d0 s t r > < / d o c > < / r e s u 5e0 l t > \n < / r e s p o n s e > \n 5f0 No < in the response.
XML parsing error
I ended up with this doc in solr: 0name="QTime">17name="fl">content"Pez"~1name="rows">1numFound="5381" start="7">Akatsuki - PE'Z ҳ | ̳ | պ | ŷ | >>> Akatsuki - PE'Z ר | и  | Ů  | ֶ  | պ  | ¸  | tӺ  | Ϸ  | Ӱ  | ϼ  | ŷ>  | ϸ  | ѵ ŷ> > Various Artists[2005] > Now Jazz 3 - That's What I Call Jazz > Akatsuki - PE'Z Akatsuki - PE'Z ר Now Jazz 3 - That's What I Call Jazz ݳ֣ Various Artists[2005] Akatsuki - PE'Z ȱ ǻᾡ첹ȱĸʣ ҵ˸ø Ӹø>>> һ/str> Note the missing < in Solrj throws this (on a larger query that includes this doc): Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[3,20624] Message: The element type "str" must be terminated by the matching end-tag "". And firefox can't render it either, throws an error. So any query that returns this doc will cause an error. Obviously there's some weird stuff in this doc, but is it a solr issue that the < got destroyed?
Re: XML parsing error
On Jul 26, 2007, at 11:49 AM, Yonik Seeley wrote: Could you try it with jetty to see if it's the servlet container? It should be simple to just copy the index directory into solr's example/solr/data directory. Yonik, sorry for my delay, but I did just try this in jetty -- it works (it doesn't throw an error, and the < in BTW, is the fact that the content is full of \uFFFD a problem? That looks to be the unicode replacement character, meaning that the real characters were lost somewhere along the line? Or is this some sort of private (non-standard) encoding? Certainly nothing I know about -- this particular index is from nutch crawls injected with solrj... so who knows. I'll look into what I can with Resin's issue. For now I'm going to delete that doc and see if I can find any others. -b
Re: Any clever ideas to inject into solr? Without http?
On Aug 9, 2007, at 11:12 AM, Kevin Holmes wrote: 2: Is there a way to inject into solr without using POST / curl / http? Check http://wiki.apache.org/solr/EmbeddedSolr There's examples in java and cocoa to use the DirectSolrConnection class, querying and updating solr w/o a web server. It uses JNI in the Cocoa case. -b
Re: Python Utilitys for Solr
On Aug 14, 2007, at 5:16 AM, Christian Klinger wrote: Hi i just play a bit with: http://svn.apache.org/repos/asf/lucene/solr/trunk/client/python/ solr.py Is it possible that this library is a bit out of date? If i try to get the example running. I got a parese error from the result. Maybe the response format form Solr has changed? Yes, check this JIRA for some issues: https://issues.apache.org/jira/browse/SOLR-216
Re: Indexing a URL
It is apparently attempting to parse &en=499af384a9ebd18f in the URL. I am not clear why it would do this as I specified indexed="false." I need to store this because that is how the user gets to the original article. the ampersand is an XML reserved character. you have to escape it (turn it into &), whether you are indexing the data or not. Nothing to do w/ Solr, just xml files in general. Whatever you're using to render the xml should be able to handle this for you.
Re: DirectSolrConnection, write.lock and Too Many Open Files
On Sep 10, 2007, at 1:33 AM, Adrian Sutton wrote: After a while we start getting exceptions thrown because of a timeout in acquiring write.lock. It's quite possible that this occurs whenever two updates are attempted at the same time - is DirectSolrConnection intended to be thread safe? We use DirectSolrConnection via JNI in a couple of client apps that sometimes have 100s of thousands of new docs as fast as Solr will have them. It would crash relentlessly if I didn't force all calls to update or query to be on the same thread using objc's @synchronized and a message queue. I never narrowed down if this was a solr issue or a JNI one.
Re: DirectSolrConnection, write.lock and Too Many Open Files
On Sep 10, 2007, at 5:00 PM, Mike Klaas wrote: On 10-Sep-07, at 1:50 PM, Adrian Sutton wrote: We use DirectSolrConnection via JNI in a couple of client apps that sometimes have 100s of thousands of new docs as fast as Solr will have them. It would crash relentlessly if I didn't force all calls to update or query to be on the same thread using objc's @synchronized and a message queue. I never narrowed down if this was a solr issue or a JNI one. That doesn't sound promising. I'll throw in synchronization around the update code and see what happens. That's doesn't seem good for performance though. Can Solr as a web app handle multiple updates at once or does it synchronize to avoid it? Solr can handle multiple simultaneous updates. The entire request processing is concurrent, as is the document analysis. Only the final write is synchronized (this includes lucene segment merging). Yes, i do want to disclaim that it's very likely my thread problems are an implementation detail w/ JNI, nothing to do w/ DSC. -b
Re: Term extraction
On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote: I'm currently looking at methods of term extraction and automatic keyword generation from indexed documents. We do it manually (not in solr, but we put the results in solr.) We do it the usual way - chunk (into n-grams, named entities & noun phrases) and count (tf & df). It works well enough. There is a bevy of literature on the topic if you want to get "smart" -- but be warned smart and fast are likely not very good friends. A lot depends on the provenance of your data -- is it clean text that uses a lot of domain specific terms? Is it webtext?
logging bad stuff separately in resin
We have a largish solr index that handles roughly 200K new docs a day and also roughly a million queries a day from other programs. It's hosted by resin. A couple of times in the past few weeks something "bad" has happened -- a lock error or file handle error, or maybe a required field wasn't being sent by the indexer for some reason. We want to be able to know about this stuff asap without having to stare at the huge resin log all day. Is there a way to filter the log that goes into resin by "bad/fatal" stuff separate from the usual request logging? I would like to put the solr errors somewhere else so it's more maintainable.
Re: Term extraction
On Sep 21, 2007, at 3:37 AM, Pieter Berkel wrote: Thanks for the response guys: Grant: I had a brief look at LingPipe, it looks quite interesting but I'm concerned that the licensing may prevent me from using it in my project. Does the opennlp license look good for you? It's LGPL. Not all the features of lingpipe but it works pretty well. https:// sourceforge.net/projects/opennlp/
Re: Nutch with SOLR
Sami has a patch in there which used a older version of the solr client. with the current solr client in the SVN tree, his patch becomes much easier. your job would be to upgrade the patch and mail it back to him so he can update his blog, or post it as a patch for inclusion in nutch/contrib (if sami is ok with that). If you have issues with how to use the solr client api, solr-user is here to help. I've done this. Apparently someone else has taken on the solr-nutch job and made it a bit more complicated (which is good for the long term) than sami's original patch -- https://issues.apache.org/jira/ browse/NUTCH-442 But we still use a version of Sami's patch that works on both trunk nutch and trunk solr (solrj.) I sent my changes to sami when we did it, if you need it let me know... -b
Re: Nutch with SOLR
But we still use a version of Sami's patch that works on both trunk nutch and trunk solr (solrj.) I sent my changes to sami when we did it, if you need it let me know... I put my files up here: http://variogr.am/latest/?p=26 -b
Re: Nutch with SOLR
On Sep 26, 2007, at 4:04 AM, Doğacan Güney wrote: NUTCH-442 is one of the issues that I want to really see resolved. Unfortunately, I haven't received many (as in, none) comments, so I haven't made further progress on it. I am probably your target customer but to be honest all we care about is using Solr to index, not for any of the searching or summary stuff in Nutch. Is there a way to get Sami's SolrIndexer in nutch trunk (now that it's working OK) sooner than later and keep working on NUTCH-442 as well? Do they conflict? -b
searching for non-empty fields
I have a large index with a field for a URL. For some reason or another, sometimes a doc will get indexed with that field blank. This is fine but I want a query to return only the set URL fields... If I do a query like: q=URL:[* TO *] I get a lot of empty fields back, like: http://thing.com What I can query for to remove the empty fields?
Re: searching for non-empty fields
thanks Peter, Hoss and Ryan.. q=(URL:[* TO *] -URL:"") This gives me 400 Query parsing error: Cannot parse '(URL:[* TO *] - URL:"")': Lexical error at line 1, column 29. Encountered: "\"" (34), after : "\"" adding something like: I'll do this but the problem here is I have to wait around for all these docs to re-index.. Your query will work if you make sure the URL field is omitted from the document at index time when the field is blank. The thing is, I thought I was omitting the field if it's blank. It's in a solrj instance that takes a lucenedocument, so maybe it's a solrj issue? if( URL != null && URL.length() > 5 ) doc.add(new Field("URL", URL, Field.Store.YES, Field.Index.UN_TOKENIZED)); And then during indexing: SimpleSolrDoc solrDoc = new SimpleSolrDoc(); solrDoc.setBoost( null, new Float ( doc.getBoost())); for (Enumeration e = doc.fields(); e.hasMoreElements();) { Field field = e.nextElement(); if (!ignoreFields.contains((field.name( { solrDoc.addField(field.name(), field.stringValue()); } } try { solr.add(solrDoc); ...
small rsync index question
I'm not using snap* scripts but i quickly need to sync up two indexes on two machines. I am rsyncing the data dirs from A to B, which work fine. But how can I see the new index on B? For some reason sending a is not refreshing the index, and I have to restart resin to see it. Is there something else I have to do?
Re: small rsync index question
Sep 28, 2007, at 5:41 PM, Yonik Seeley wrote: It should... are there any errors in the logs? do you see the commit in the logs? Check the stats page to see info about when the current searcher was last opened too. ugh, nevermind.. was committing the wrong solr index... but Thanks yonik for the response But luckily I can try save face with a followon question :) I regularly see file has vanished: "/dir/solr/data/index/segments_3aut" when rsyncing, and when that happens i get an error on the rsync'd copy. The index I am rsyncing is large (50GB) and very active, is constantly getting new docs and searched on. What can I do to preserve the index state while syncing?
dismax downweighting
i have a dismax query where I want to boost appearance of the query terms in certain fields but "downboost" appearance in others. The practical use is a field containing a lot of descriptive text and then a product name field where products might be named after a descriptive word. Consider an electric toothbrush called "The Fast And Thorough Toothbrush" -- if a user searches for fast toothbrush I'd like to down-weight that particular model's advantage. The name of the product might also be in the descriptive text. I tried -name description but solr didn't like that. Any better ideas? -- http://variogr.am/
Lock obtain timed out
We have a very active large index running a solr trunk from a few weeks ago that has been going down about once a week for this: [11:08:17.149] No lockType configured for /home/bwhitman/XXX/XXX/ discovered-solr/data/index assuming 'simple' [11:08:17.150] org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: SimpleFSLock@/home/bwhitman/XXX/XXX/discovered- solr/data/index/lucene-5b07ebeb7d53a4ddc5a950a458af4acc-write.lock [11:08:17.150] at org.apache.lucene.store.Lock.obtain(Lock.java:70) [11:08:17.150] at org.apache.lucene.index.IndexWriter.init (IndexWriter.java:598) [11:08:17.150] at org.apache.lucene.index.IndexWriter. (IndexWriter.java:410) [11:08:17.150] at org.apache.solr.update.SolrIndexWriter. (SolrIndexWriter.java:97) [11:08:17.150] at org.apache.solr.update.UpdateHandler.createMainIndexWriter (UpdateHandler.java:121) [11:08:17.150] at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandl We have the following in our solrconfig re: locks: 1000 1 false What can I do to mitigate this problem? Removing the lock file and restarting resin solves it, but only temporarily.
Re: Lock obtain timed out
Thanks to ryan and matt.. so far so good. true single
grouped clause search in dismax
I have a dismax handler to match product names found in free text that looks like: explicit 0.01 name^5 nec_name^3 ne_name * 100 *:* name is type string, nec_name and ne_name are special types that do domain-specific stopword removal, latin1 munging etc, all are confirmed working fine on their own. Say I have a product called "SUPERBOT" and I want the text "I love SUPERBOT" to match the product SUPERBOT pretty high. In Lucene or Solr on its own you'd do something like: name:(I love SUPERBOT)^5 nec_name:(I love SUPERBOT)^3 ne_name:(I love SUPERBOT) which works fine. And so does: qt=thing&q=SUPERBOT But this doesn't work: qt=thing&q=(I%20love%20SUPERBOT) nor does qt=thing&q=I%20love%20SUPERBOT -- they get no results. How can we do "grouped clause" queries in dismax?
Re: How to get number of indexed documents?
does http://.../solr/admin/luke work for you? 601818 ... On Nov 1, 2007, at 10:39 PM, Papalagi Pakeha wrote: Hello, Is there any way to get XML version of statistics like how many documents are indexed etc? I have found http://.../solr/admin/properties which is cool but doesn't give me the number of indexed documents. Thanks PaPa -- http://variogr.am/
"overlapping onDeckSearchers" message
I have a solr index that hasn't had many problems recently but I had the logs open and noticed this a lot during indexing: [16:23:34.086] PERFORMANCE WARNING: Overlapping onDeckSearchers=2 Not sure what it means, google didn't come back with much.
Re: start.jar -Djetty.port= not working
On Nov 7, 2007, at 10:00 AM, Mike Davies wrote: java -Djetty.port=8521 -jar start.jar However when I run this it seems to ignore the command and still start on the default port of 8983. Any suggestions? Are you using trunk solr or 1.2? I believe 1.2 still shipped with an older version of jetty that doesn't follow the new-style CL arguments. I just tried it on trunk and it worked fine for me. -- http://variogr.am/ [EMAIL PROTECTED]
Re: start.jar -Djetty.port= not working
On Nov 7, 2007, at 10:07 AM, Mike Davies wrote: I'm using 1.2, downloaded from http://apache.rediris.es/lucene/solr/ Where can i get the trunk version? svn, or http://people.apache.org/builds/lucene/solr/nightly/
Re: LSA Implementation
On Nov 26, 2007 6:58 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is patented, so it is not likely to happen unless the authors donate the patent to the ASF. -Grant There are many ways to catch a bird... LSA reduces to SVD on the TF graph. I have had limited success using JAMA's SVD, which is PD. It's pure java; for something serious you'd want to wrap the hard bits in MKL/Accelerate. A more interesting solr related question is where a very heavy process like SVD would operate. You'd want to run the 'training' half of it separate from a indexing or querying. It'd almost be like an optimize. Is there any hook right now to give Solr a "command" like and map it to the class in the solrconfig? The classify half of the SVD can happen at query or index time, very quickly, I imagine that could even be a custom field type.
Re: Solr and nutch, for reading a nutch index
On Nov 27, 2007, at 6:08 PM, bbrown wrote: I couldn't tell if this was asked before. But I want to perform a nutch crawl without any solr plugin which will simply write to some index directory. And then ideally I would like to use solr for searching? I am assuming this is possible? yes, this is quite possible. You need to have a solr schema that mimics the nutch schema, see sami's solrindexer for an example. Once you've got that schema, simply set the data dir in your solrconfig to the nutch index location and you'll be set.
Re: Solr and nutch, for reading a nutch index
On Nov 28, 2007, at 1:24 AM, Otis Gospodnetic wrote: I only glanced at Sami's post recently and what I think I saw there is something different. In other words, what Sami described is not a Solr instance pointing to a Nutch-built Lucene index, but rather an app that reads the appropriate Nutch/Hadoop files with fetched content and posts the read content to a Solr instance using a Solr java client like solrj. No? Yes, to be clear, all you need from Sami's thing is the schema file. Ignore everything else. Then point solr at the nutch index directory (it's just a lucene index.) Sami's entire thing is for indexing with solr instead of nutch, separate issue... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Norberto Meijome <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Cc: [EMAIL PROTECTED] Sent: Tuesday, November 27, 2007 8:33:18 PM Subject: Re: Solr and nutch, for reading a nutch index On Tue, 27 Nov 2007 18:12:13 -0500 Brian Whitman <[EMAIL PROTECTED]> wrote: On Nov 27, 2007, at 6:08 PM, bbrown wrote: I couldn't tell if this was asked before. But I want to perform a nutch crawl without any solr plugin which will simply write to some index directory. And then ideally I would like to use solr for searching? I am assuming this is possible? yes, this is quite possible. You need to have a solr schema that mimics the nutch schema, see sami's solrindexer for an example. Once you've got that schema, simply set the data dir in your solrconfig to the nutch index location and you'll be set. I think you should keep an eye on the versions of Lucene library used by both Nutch + Solr - differences at this layer *could* make them incompatible - but I am not an expert... B _ {Beto|Norberto|Numard} Meijome "Against logic there is no armor like ignorance." Laurence J. Peter I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. -- http://variogr.am/
can I do *thing* substring searches at all?
With a fieldtype of string, can I do any sort of *thing* search? I can do thing* but not *thing or *thing*. Workarounds?
Re: Re:
On Dec 2, 2007, at 5:43 PM, Ryan McKinley wrote: try \& rather then %26 or just put quotes around the whole url. I think curl does the right thing here.
Re: RE: Re:
On Dec 2, 2007, at 6:00 PM, Andrew Nagy wrote: On Dec 2, 2007, at 5:43 PM, Ryan McKinley wrote: try \& rather then %26 or just put quotes around the whole url. I think curl does the right thing here. I tried all the methods: converting & to %26, converting & to \& and encapsulating the url with quotes. All give the same error. curl http://localhost:8080/solr/update/csv?header=true\&seperator=%7C \&encapsulator=%22\&commit=true\&stream.file=import/homes.csv seperator -> separator ? Does that help?
Re: RE: Re:
On Dec 2, 2007, at 5:29 PM, Andrew Nagy wrote: Sorry for not explaining my self clearly: I have header=true as you can see from the curl command and there is a header line in the csv file. was this your actual curl request? curl http://localhost:8080/solr/update/csv?header=true%26seperator=%7C%26encapsulator=%22%26commit=true%26stream.file=import/homes.csv you're escaping the ampersands if so... just keep them as &
Re: out of heap space, every day
For faceting and sorting, yes. For normal search, no. Interesting you mention that, because one of the other changes since last week besides the index growing is that we added a sort to an sint field on the queries. Is it reasonable that a sint sort would require over 2.5GB of heap on a 8M index? Is there any empirical data on how much RAM that will need?
out of heap space, every day
This maybe more of a general java q than a solr one, but I'm a bit confused. We have a largish solr index, about 8M documents, the data dir is about 70G. We're getting about 500K new docs a week, as well as about 1 query/second. Recently (when we crossed about the 6M threshold) resin has been stopping with the following: /usr/local/resin/log/stdout.log:[12:08:21.749] [28304] HTTP/1.1 500 Java heap space /usr/local/resin/log/stdout.log:[12:08:21.749] java.lang.OutOfMemoryError: Java heap space Only a restart of resin will get it going again, and then it'll crash again within 24 hours. It's a 4GB machine and we run it with args="-J-mx2500m -J-ms2000m" We can't really raise this any higher on the machine. Are there 'native' memory requirements for solr as a function of index size? Does a 70GB index require some minimum amount of wired RAM? Or is there some mis-configuration w/ resin or solr or my system? I don't really know Java well but it seems strange that the VM can't page RAM out to disk or really do something else beside stopping the server.
Re: out of heap space, every day
int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms. Then double that to allow for a warming searcher. This is great, but can you help me parse this? Assume 8M docs and I'm sorting on an int field that is unix time (seonds since epoch.) For the purposes of the experiment assume every doc was indexed at a unique time. so.. (int[800] + String[800], each term is 16 chars + 800*4) * 2 that's 384MB by my calculation. Is that right?
solrj - adding a SolrDocument (not a SolrInputDocument)
Writing a utility in java to do a copy from one solr index to another. I query for the documents I want to copy: SolrQuery q = new SolrQuery(); q.setQuery("dogs"); QueryResponse rq = source_solrserver.query(q); for( SolrDocument d : rq.getResults() ) { // now I want to add these to a new server after modifying it slightly d.addField("newField", "somedata"); dest_solrserver.add(d); } but that doesn't work-- add wants a SolrInputDocument, not a SolrDocument. I can't cast or otherwise easily create the former from the latter. How could I do this sort of thing?
Re: solrj - adding a SolrDocument (not a SolrInputDocument)
On Dec 6, 2007, at 3:07 PM, Ryan McKinley wrote: public static SolrInputDocument toSolrInputDocument( SolrDocument d ) { SolrInputDocument doc = new SolrInputDocument(); for( String name : d.getFieldNames() ) { doc.addField( name, d.getFieldValue(name), 1.0f ); } return doc; } thanks, that worked! agree it's useful to have in clientutils... though I'm not sure why there needs to be two separate classes to begin with.
Re: Solr and Flex
On Dec 13, 2007, at 10:42 AM, jenix wrote: I'm using Flex for the frontend interface and Solr on backend for the search engine. I'm new to Flex and Flash and thought someone might have some code integrating the two. We've done light stuff querying solr w/ actionscript. It is pretty simple, you form your query as a url, get the url and then use AS's built in xml parser to get whatever you need. Haven't tried posting documents.
Re: debugging slowness
On Dec 20, 2007, at 11:02 AM, Otis Gospodnetic wrote: Sounds like GC to me. That is, the JVM not having large enough heap. Run jconsole and you'll quickly see if this guess is correct or not (kill -QUIT is also your friend, believe it or not). We recently had somebody who had a nice little Solr spellchecker instance running, but after awhile it would "stop responding". We looked at the command-line used to invoke the servlet container and didn't see -Xmx. :)] I'm giving resin args="-J-mx1m -J-ms5000m" (this is a amazon xtra- large instance w/ 16GB), it's using it PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2738 root 18 0 10.4g 9.9g 9756 S 231 66.0 48:07.66 java After a restart yesterday and normal operation we haven't seen the problem creep back in yet. I might get my perl on to graph the query time and see if it's steadily increasing. Can't run jconsole, no X at the moment, if need be I'll install it though...
Re: Status 500 - ParseError at [row,col]:[1,1] Message Content is not allowed in Prolog
On Jan 8, 2008, at 10:58 AM, Kirk Beers wrote: curl http://localhost:8080/solr/update -H "Content-Type:text/xml" -- data-binary '/overwritePending="true">0001field>TitleIt was the best of times it was the worst of times blah blah blahdoc>' Why the / after the first single quote?
Re: Status 500 - ParseError at [row,col]:[1,1] Message Content is not allowed in Prolog
I found that on the Wiki at http://wiki.apache.org/solr/UpdateXmlMessages#head-3dfbf90fbc69f168ab6f3389daf68571ad614bef under the title: Updating a Data Record via curl. I removed it and now have the following: 0name="QTime">122This response format is experimental. It is likely to change in the future. Seems to be an error in the wiki. I changed it. Commit and you should see your test document in queries.
index out of disk space, CorruptIndexException
We had an index run out of disk space. Queries work fine but commits return 500 doc counts differ for segment _18lu: fieldsReader shows 104 but segmentInfo shows 212 org.apache.lucene.index.CorruptIndexException: doc counts differ for segment _18lu: fieldsReader shows 104 but segmentInfo shows 212 at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:191) I've made room, restarted resin, and now solr won't start. No useful messages in the startup, just a [21:01:49.105] Could not start SOLR. Check solr/home property [21:01:49.105] java.lang.NullPointerException [21:01:49.105] at org .apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java: 100) What can I do from here?
Re: index out of disk space, CorruptIndexException
On Jan 14, 2008, at 4:08 PM, Ryan McKinley wrote: ug -- maybe someone else has better ideas, but you can try: http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/index/CheckIndex.java thanks for the tip, i did run that, but I stopped it 30 minutes in, as it was still on the first (out of 46) segment.. The index is (was) 129GB. I just restored to an older index and made this ticket, https://issues.apache.org/jira/browse/SOLR-455
Re: Missing Content Stream
On Jan 15, 2008, at 1:50 PM, Ismail Siddiqui wrote: Hi Everyone, I am new to solr. I am trying to index xml using http post as follows Ismail, you seem to have a few spelling mistakes in your xml string. "fiehld, nadme" etc. (a) try fixing them, (b) try solrj instead, I agree w/ otis.
Re: best way to get number of documents in a Solr index
On Jan 15, 2008, at 3:47 PM, Maria Mosolova wrote: Hello, I am looking for the best way to get the number of documents in a Solr index. I'd like to do it from a java code using solrj. public int resultCount() { try { SolrQuery q = new SolrQuery("*:*"); QueryResponse rq = solr.query(q); return rq.getResults().getNumFound(); } catch (org.apache.solr.client.solrj.SolrServerException e) { System.err.println("Query problem"); } catch (java.io.IOException e) { System.err.println("Other error"); } return -1; }
Re: Newbie with Java + typo
On Jan 21, 2008, at 11:13 AM, Daniel Andersson wrote: Well, no. "Immutable Page", and as far as I know (english not being my mother tongue), that means I can't edit the page You need to create an account first.
Re: SolrPhpClient with example jetty
$document->title = 'Some Title'; $document->content = 'Some content for this wonderful document. Blah blah blah.'; did you change the schema? There's no title or content field in the default example schema. But I believe solr does output different errors for that.
Re: Cache size clarification
On Jan 28, 2008, at 6:05 PM, Alex Benjamen wrote: I need some clarification on the cache size parameters in the solrconfig. Suppose I'm using these values: A lot of this is here: http://wiki.apache.org/solr/SolrCaching
Re: SEVERE: java.lang.OutOfMemoryError: Java heap space
On Jan 28, 2008, at 7:06 PM, Leonardo Santagada wrote: On 28/01/2008, at 20:44, Alex Benjamen wrote: I could allocate more physical memory, but I can't seem to increase the -Xmx option to 3800 I get an error : "Could not reserve enough space for object heap", even though I have more than 4Gb free. (We're running on Intel quad core 64bit) When I try strace I'm seeing mmap2 errors. ] I don't know much about java... but can you get any program to map more than 4gb of memory? I know windows has hard limits on how much memory you can map to one process and linux I think has some limit too. Of course it can be configured but maybe it is just a system configuration problem. We use 10GB of ram in one of our solr installs. You need to make sure your java is 64 bit though. Alex, what does your java -version show? Mine shows java version "1.6.0_03" Java(TM) SE Runtime Environment (build 1.6.0_03-b05) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_03-b05, mixed mode) And I run it with -mx1m -ms5000m
Re: SEVERE: java.lang.OutOfMemoryError: Java heap space
But on Intel, where I'm having the problem it shows: java version "1.6.0_10-ea" Java(TM) SE Runtime Environment (build 1.6.0_10-ea-b10) Java HotSpot(TM) Server VM (build 11.0-b09, mixed mode) I can't seem to find the Intel 64 bit JDK binary, can you pls. send me the link? I was downloading from here: http://download.java.net/jdk6/ Install the AMD64 version. (Confusingly, AMD64 is a spec name for EM64T, which is now what both AMD and Intel use) If that still doesn't work, is it possible that your machine/kernel is not set up to support 64 bit?
date math syntax
Is there a wiki page or more examples of the "date math" parsing other than this: http://www.mail-archive.com/solr-user@lucene.apache.org/msg01563.html out there somewhere? From an end user query perspective. -b
Re: Converting Solr results to java query/collection/map object
On Feb 19, 2008, at 3:08 PM, Paul Treszczotko wrote: Hi, I'm pretty new to SOLR and I'd like to ask your opinion on the best practice for converting XML results you get from SOLR into something that is better fit to display on a webpage. I'm looking for performance and relatively small footprint, perhaps ability to paginate thru the result set and display/process N results at a time. Any ideas? Any tutorials you can point me to? Thanks! Paul, this is what solrj is for. SolrQuery q = new SolrQuery(); q.setRows(10); q.setStart(40); q.setQuery("type:dogs"); QueryResponse rq = solrServer.query(q); for( SolrDocument d : rq.getResults() ) { String dogname = (String)d.getFieldValue("name"); ...
will hardlinks work across partitions?
Will the hardlink snapshot scheme work across physical disk partitions? Can I snapshoot to a different partition than the one holding the live solr index?
can I form a SolrQuery and query a SolrServer in a request handler?
I'm in a request handler: public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) { And in here i want to form a SolrQuery based on the req, query the searcher and return results. But how do I get a SolrServer out of the req? I can get a SolrIndexSearcher but that doesn't seem to let me pass in a SolrQuery. I need a SolrQuery because I am forming a dismax query with a function query, etc...
Re: can I form a SolrQuery and query a SolrServer in a request handler?
Perhaps back up and see if we can do this a simpler way than a request handler... What is the query structure you are trying to generate? I have two dismax queries defined in a solrconfig. Something like ... raw^4 name^1 ... tags^3 type^2 They work fine on their own, and we often use &bf=sortable^... to change the ordering. But we want to merge them. Result IDs that show up in both need to go higher and with a url param we need to weight between the two. So I am making a /combined requesthandler that takes the query, the weights between the two and the value of the bf=sortable boost. My handler: /combined?q=kittens&q1=0.5&q2=0.8&bfboost=2.0 Would query ?qt=q1&q=kittens&bf=2&fl=id, then ? qt=q2&q=kittens&bf=2&fl=id. The request handler would return the results of a term query with the (q1 returned IDs)^0.5 (q2 returned IDs)^0.8.
Re: can I form a SolrQuery and query a SolrServer in a request handler?
Would query ?qt=q1&q=kittens&bf=2&fl=id, then ? qt=q2&q=kittens&bf=2&fl=id. Sorry, I meant: ?qt=q1&q=kittens&bf=sortable^2&fl=id, then ? qt=q2&q=kittens&bf=sortable^2&fl=id
invalid XML character
Once in a while we get this javax.xml.stream.XMLStreamException: ParseError at [row,col]:[4,790470] [14:32:21.877] Message: An invalid XML character (Unicode: 0x6) was found in the element content of the document. [14:32:21.877] at com .sun .org .apache .xerces .internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:588) [14:32:21.877] at org .apache .solr .handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java: 318) [14:32:21.877] at org .apache .solr .handler .XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195) ... Our data comes from all sorts of places and although we've tried to be utf8 wherever we can, there are still cracks. I would much rather a document get added with replacement character than to have this error prevent the addition of 8K documents (as has happened here, this one character was in a 8K ..and only the ones before this character were added.) Is there something I can do on the solr side to ignore/replace invalid characters?