solr error when querying.
Here is my query: http://127.0.0.1:/solr/JOBS/select/??q=Apache&wt=xslt&tr=example.xslt The response I get is the following. I have example.xslt in the /conf/xslt path. What is wrong here? Thanks! HTTP ERROR 500 Problem accessing /solr/JOBS/select/. Reason: getTransformer fails in getContentType java.lang.RuntimeException: getTransformer fails in getContentType at org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:72) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:326) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:129) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:59) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:122) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:110) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.io.IOException: Unable to initialize Templates 'example.xslt' at org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:117) at org.apache.solr.util.xslt.TransformerProvider.getTransformer(TransformerProvider.java:77) at org.apache.solr.response.XSLTResponseWriter.getTransformer(XSLTResponseWriter.java:130) at org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:69) ... 23 more Caused by: javax.xml.transform.TransformerConfigurationException: Could not compile stylesheet at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.newTemplates(Unknown Source) at org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:110) ... 26 more -- View this message in context: http://lucene.472066.n3.nabble.com/solr-error-when-querying-tp3985677.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: getTransformer error
Anyone found a solution to the getTransformer error. I am getting the same error. Here is my output: Problem accessing /solr/JOBS/select/. Reason: getTransformer fails in getContentType java.lang.RuntimeException: getTransformer fails in getContentType at org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:72) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:326) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:261) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:129) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:59) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:122) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:110) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.io.IOException: Unable to initialize Templates 'example.xslt' at org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:117) at org.apache.solr.util.xslt.TransformerProvider.getTransformer(TransformerProvider.java:77) at org.apache.solr.response.XSLTResponseWriter.getTransformer(XSLTResponseWriter.java:130) at org.apache.solr.response.XSLTResponseWriter.getContentType(XSLTResponseWriter.java:69) ... 23 more Caused by: javax.xml.transform.TransformerConfigurationException: Could not compile stylesheet at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.newTemplates(Unknown Source) at org.apache.solr.util.xslt.TransformerProvider.getTemplates(TransformerProvider.java:110) ... 26 more -- View this message in context: http://lucene.472066.n3.nabble.com/getTransformer-error-tp3047726p3985687.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to manage resource out of index?
hi li, i looked at doing something similar - where we only index the text but retrieve search results / highlight from files -- we ended up giving up because of the amount of customisation required in solr -- mainly because we wanted the distributed search functionality in solr which meant making sure the original file ended up the same filing system i.e. machine too!). we ended up just storing the main text field too even though there was a bit of text -- in the end solr/lucene can handle the index size fine and disk space is cheaper than man-hours to customise solr/lucene to work in this way! that was our conclusion anyway and it works fine -- we also have separate index / search server(s) so we don't care about merge time either -- and as i said above - we use the distributed search so don't tend to need to merge very large indexes anyway. when your system grows / you go into production you'll probably split the indexes too to use solr's distributed search func. for the sake of query speed). hope that helps, bec :) On 7 July 2010 14:07, Li Li wrote: > I used to store full text into lucene index. But I found it's very > slow when merging index because when merging 2 segments it copy the > fdt files into a new one. So I want to only index full text. But When > searching I need the full text for applications such as hightlight and > view full text. I can store the full text by pair in > database and load it to memory. And When I search in lucene(or solr), > I retrive url of doc first, then use url to get full text. But when > they are stored separately, it is hard to managed. They may be not > consistent with each other. Does lucene or solr provied any method to > ease this problem? Or any one has some experience of this problem? > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Faceting unknown fields
hi, > So, can I index and facet these fields, without describe then in my schema? > > I will first try with dynamic fields, but I'm not sure it's going to work. we do all our facet fields in this way, with just general string field for single/multivalued fields: and faceting works... but you will still need to know the specific name of the field(s) to use in the facet.field URL parameter (i.e. as long as your UI knows!). hope that helps bec :)
Re: fq= "more then one" ?
hi, you shouldn't have two fq parameters -- some solr params work like that, but fq doesn't > http://172.20.1.33:8983/solr/select/?q=*:*&start=0&fq=EMAIL_HEADER_FROM:t...@mail.de&fq=EMAIL_HEADER_TO:t...@mail.de you need to combine it into a single param i.e. try putting it as an "OR" or "AND" if you're using the standard request handler: fq=EMAIL_HEADER_FROM:t...@mail.de%20or%20email_header_to:t...@mail.de or put something like + if you're using dismax (i think but i don't use it :) ) hope that helps, bec :)
Re: fq= "more then one" ?
oops - i thought you couldn't put more than one - ignore my answer then :) On 12 July 2010 17:20, Rebecca Watson wrote: > hi, > > you shouldn't have two fq parameters -- some solr params work like > that, but fq doesn't > >> http://172.20.1.33:8983/solr/select/?q=*:*&start=0&fq=EMAIL_HEADER_FROM:t...@mail.de&fq=EMAIL_HEADER_TO:t...@mail.de > > you need to combine it into a single param i.e. try putting it as an > "OR" or "AND" if you're using the standard request handler: > > fq=EMAIL_HEADER_FROM:t...@mail.de%20or%20email_header_to:t...@mail.de > > or put something like + if you're using dismax (i think but i don't use it :) > ) > > hope that helps, > > bec :) >
Re: Problem with Wildcard searches in Solr
Hi, earlier this week i started messing with getting wildcard queries to be analysed i've got some weird analysers doing stemming/lowercasing and writing in the same rules into a custom queryparser didn't seem logical given i just want the analysers to apply as they do at index time i came up with the hack below, which is just a modified version of the LuceneQParserPlugin ie. the solr default one which creates a SolrQueryParser query parser. in the SolrQueryParser I overwrite the "getWildcardQuery" function so that I insert a call to my method - "myWildcardQuery". myWildcardQuery method converts the wildcard term into an analysed version which it returns (and at least lowercases the if analysis fails for some reason). the myWildcardQuery method is just pulling in code from lucene's QueryParser.getFieldQuery -- so all this code is a magical giant cut and paste job right now (which you'll see when you look at the lucene/solr classes involved!) you use this custom queryparser in the usual way i.e. by registering the queryparser in the solrconfig.xml file: then call that queryparser in your request handler: ilexirQparser explicit 10 0 *,score 2.2 standard on spellcheck tvComponent i enable the leading wildcard queries using the reversedwildcard filter as per previous email i.e. in index-time analyser add in: (not at query time) -- then the lucene query parser picks up the use of this filter and allows leading wildcard queries. of course, non of this is going to sort out trying to match against the query "co?mput?r" because you've probably stemmed "computer" to "comput" or something at index time -- but if you add in a copyfield to an extra field that isn't stemmed at query time, then query both the original + the non-stemmed field (boost accordingly -- i.e. you might want to boost the original non-stemmed field higher!) you'll get the right match then :) i'd be interested to hear from lucene/solr contributors why wildcards aren't analysed in general anyway? anyway hope that helps :) bec -- import java.io.IOException; import java.io.StringReader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.CachingTokenFilter; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.reverse.ReverseStringFilter; import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; import org.apache.lucene.analysis.tokenattributes.TermAttribute; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Query; import org.apache.lucene.search.WildcardQuery; import org.apache.solr.analysis.ReversedWildcardFilterFactory; import org.apache.solr.common.params.CommonParams; import org.apache.solr.common.params.SolrParams; import org.apache.solr.common.util.NamedList; import org.apache.solr.request.SolrQueryRequest; import org.apache.solr.search.LuceneQParserPlugin; import org.apache.solr.search.QParser; import org.apache.solr.search.QueryParsing; import org.apache.solr.search.SolrQueryParser; /** * modifies the code from LuceneQParserPlugin i.e. the default query parser * plugin used by solr. * @author bec */ public class ilexirQParserPlugin extends LuceneQParserPlugin { public static String NAME = "lucene"; public void init(NamedList args) { } public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { return new ilexirQParser(qstr, localParams, params, req); } } class ilexirQParser extends QParser { String sortStr; SolrQueryParser lparser; public ilexirQParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) { super(qstr, localParams, params, req); } public Query parse() throws ParseException { String qstr = getString(); String defaultField = getParam(CommonParams.DF); if (defaultField == null) { defaultField = getReq().getSchema().getDefaultSearchFieldName(); } lparser = new SolrQueryParser(this, defaultField) { /** * adapted from lucene's QueryParser.getFieldQuery !! * * @param field * @param termStr */ private String myWildcardQuery(String field, String termStr) { System.out .println("ILEXIR: ORIGINAL WILDCARD QUERY:" + termStr); // get the corresponding analyser - this one is
Re: Problem with Wildcard searches in Solr
hi, sorry realised i had a typo: > of course, non of this is going to sort out trying to match against the query > "co?mput?r" because you've probably stemmed "computer" to "comput" or > something > at index time -- but if you add in a copyfield to an extra field that > isn't stemmed > at query time, then query both the original + the non-stemmed field (boost > accordingly -- i.e. you might want to boost the original non-stemmed field > higher!) you'll get the right match then :) > should read - "but if you add in a copyfield to an extra field that isn't stemmed at index time" bec :)
Re: Locked Index files
shut down your solr server first... if its not important! :) On 13 July 2010 16:47, ZAROGKIKAS,GIORGOS wrote: > I found it but I can not delete > Any suggestion??? > > -Original Message- > From: Yuval Feinstein [mailto:yuv...@answers.com] > Sent: Tuesday, July 13, 2010 11:39 AM > To: solr-user@lucene.apache.org > Subject: RE: Locked Index files > > Hi Giorgos. > Try looking for write.lock files and deleting them. > Cheers, > Yuval > > -Original Message- > From: ZAROGKIKAS,GIORGOS [mailto:g.zarogki...@multirama.gr] > Sent: Tuesday, July 13, 2010 11:28 AM > To: solr-user@lucene.apache.org > Subject: Locked Index files > > Hi > My solr Index files are locked and I can’t index anything > How can I remove the lock file ? > I can’t delete it > > > >
faceting over field not in all documents
hi, has anyone had experience with faceting over a field where the field is not present in all documents within the index? i'm hoping that -- faceting simply calculates+returns the counts for docs that have the field present while results may still contain documents that don't have the facet field (i.e. the field faceted on)? thanks for any help, i guess if no one has tried this i'll let you know :) bec :)
Re: faceting over field not in all documents
brilliant! thanks very much for your help :) On 13 July 2010 21:47, Jonathan Rochkind wrote: >> i'm hoping that -- faceting simply calculates+returns the counts for docs >> that >> have the field present while results may still contain documents that don't >> have the facet field (i.e. the field faceted on)? > > Yes, that's exactly what happens. You can use facet.missing to get a count > for documents with no value in the facet field too, if you want. > > Jonathan
Re: Error in building Solr-Cloud (ant example)
hi mark, jayf and i are working together :) i tried to apply the patch to the trunk, but the ant tests failed... i checked out the latest trunk: svn checkout http://svn.apache.org/repos/asf/lucene/dev/trunk patched it with SOLR-1873, and put the two JARs into trunk/solr/lib ant compile in the top level trunk directory worked fine, but ant test had a few errors. the first error was: [junit] Testsuite: org.apache.solr.cloud.BasicZkTest [junit] Testcase: testBasic(org.apache.solr.cloud.BasicZkTest): Caused an ERROR [junit] maxClauseCount must be >= 1 [junit] java.lang.IllegalArgumentException: maxClauseCount must be >= 1 [junit] at org.apache.lucene.search.BooleanQuery.setMaxClauseCount(BooleanQuery.java:62) [junit] at org.apache.lucene.util.LuceneTestCase.tearDown(LuceneTestCase.java:131) [junit] at org.apache.solr.util.AbstractSolrTestCase.tearDown(AbstractSolrTestCase.java:182) [junit] at org.apache.solr.cloud.AbstractZkTestCase.tearDown(AbstractZkTestCase.java:135) [junit] at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:277) [junit] after this, tests passed until there were a lot of errors with this output: [junit] - Standard Error - [junit] Jul 15, 2010 3:00:53 PM org.apache.solr.handler.SnapPuller fetchLatestIndex [junit] SEVERE: Master at: http://localhost:TEST_PORT/solr/replication is not available. Index fetch failed. Exception: Invalid uri 'http://localhost:TEST_PORT/solr/replication': invalid port number followed by a final message: [junit] SEVERE: Master at: http://localhost:57146/solr/replication is not available. Index fetch failed. Exception: Connection refused a few more tests passed... then at the end: BUILD FAILED /Users/iwatson/work/solr/trunk/build.xml:31: The following error occurred while executing this line: /Users/iwatson/work/solr/trunk/solr/build.xml:395: The following error occurred while executing this line: /Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed! The following error occurred while executing this line: /Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed! The following error occurred while executing this line: /Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed! The following error occurred while executing this line: /Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed! The following error occurred while executing this line: /Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed! The following error occurred while executing this line: /Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed! The following error occurred while executing this line: /Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed! The following error occurred while executing this line: /Users/iwatson/work/solr/trunk/solr/build.xml:477: Tests failed! are these errors currently expected (i.e. issues being sorted) or does it look like i'm doing something wrong/stupid!? thanks for your help bec :) On 5 July 2010 04:34, Mark Miller wrote: > Hey jayf - > > Offhand I'm not sure why you are having these issues - last I knew, a > couple people had had success with the cloud branch. Cloud has moved on > from that branch really though - we probably should update the wiki > about that. More important than though, that I need to get Cloud > committed to trunk! > > I've been saying it for a while, but I'm going to make a strong effort > to wrap up the final unit test issue (apparently a testing issue, not > cloud issue) and get this committed for further iterations. > > The way to follow along with the latest work is to go to : > https://issues.apache.org/jira/browse/SOLR-1873 > > The latest patch there should apply to recent trunk. > > I've scheduled a bit of time to work on getting this committed this > week, fingers crossed. > > -- > - Mark > > http://www.lucidimagination.com > > On 7/4/10 3:37 PM, jayf wrote: >> >> Hi there, >> >> I'm having a trouble installing Solr Cloud. I checked out the project, but >> when compiling ("ant example" on OSX) I get compile a error (cannot find >> symbol - pasted below). >> >> I also get a bunch of warnings: >> [javac] Note: Some input files use or override a deprecated API. >> [javac] Note: Recompile with -Xlint:deprecation for details. >> I have tried both Java 1.5 and 1.6. >> >> >> Before I got to this point, I was having problems with the included >> ZooKeeper jar (java versioning issue) - so I had to download the source and >> build this. Now 'ant' gets a bit further, to the stage listed above. >> >> Any idea of the problem??? THANKS! >> >> [javac] Compiling 438 source files to >> /Volumes/newpart/solrcloud/cloud/build/solr >> [javac] >> /Volumes/newpart/solrcloud/cloud/src/java/org/apache/solr/cloud/ZkController.java:588: >> cannot find symbol >> [javac] symbol : method stringPropertyNames() >> [javac] location: class java.util.Properties
Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req
hi, would faceting work? http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr if you have a field for rootId that is multivalued + facet on it -- you'll get value+count pairs back (top 100 i think by default) bec :) On 16 July 2010 16:07, Ninad Raut wrote: > Hi, > > I have a scenario in which I have to find count of distinct unique IDs > present in a field (rootId field in my case) for a particular query. > > I require this for pagination purpose. > > Is there a way in Solr to do something like this we do in SQL: > > select count(distinct(rootId)) > from table > where (the query part). > > > Regards, > Ninad R >
Indexing Hanging during GC?
Hi, When indexing large amounts of data I hit a problem whereby Solr becomes unresponsive and doesn't recover (even when left overnight!). I think i've hit some GC problems/tuning is required of GC and I wanted to know if anyone has ever hit this problem. I can replicate this error (albeit taking longer to do so) using Solr/Lucene analysers only so I thought other people might have hit this issue before over large data sets Background on my problem follows -- but I guess my main question is -- can Solr become so overwhelmed by update posts that it becomes completely unresponsive?? Right now I think the problem is that the java GC is hanging but I've been working on this all week and it took a while to figure out it might be GC-based / wasn't a direct result of my custom analysers so i'd appreciate any advice anyone has about indexing large document collections. I also have a second questions for those in the know -- do we have a chance of indexing/searching over our large dataset with what little hardware we already have available?? thanks in advance :) bec a bit of background: --- I've got a large collection of articles we want to index/search over -- about 180k in total. Each article has say 500-1000 sentences and each sentence has about 15 fields, many of which are multi-valued and we store most fields as well for display/highlighting purposes. So I'd guess over 100 million index documents. In our small test collection of 700 articles this results in a single index of about 13GB. Our pipeline processes PDF files through to Solr native xml which we call "index.xml" files i.e. in ... format ready to post straight to Solr's update handler. We create the index.xml files as we pull in information from a few sources and creation of these files from their original PDF form is farmed out across a grid and is quite time-consuming so we distribute this process rather than creating index.xml files on the fly... We do a lot of linguistic processing and to enable search functionality of our resulting terms requires analysers that split terms/ join terms together i.e. custom analysers that perform string operations and are quite time-consuming/ have large overhead compared to most analysers (they take approx 20-30% more time and use twice as many short-lived objects than the "text" field type). Right now i'm working on my new Imac: quad-core 2.8 GHz intel Core i7 16 GB 1067 MHz DDR3 RAM 2TB hard-drive (about half free) Version 10.6.4 OSX Production environment: 2 linux boxes each with: 8-core Intel(R) Xeon(R) CPU @ 2.00GHz 16GB RAM I use java 1.6 and Solr version 1.4.1 with multi-cores (a single core right now). I setup Solr to use autocommit as we'll have several document collections / post to Solr from different data sets: 50 90 I also have false 1024 10 - *** First question: Has anyone else found that Solr hangs/becomes unresponsive after too many documents are indexed at once i.e. Solr can't keep up with the post rate? I've got LCF crawling my local test set (file system connection required only) and posting documents to Solr using 6GB of RAM. As I said above, these documents are in native Solr XML format () with one file per article so each contains all the sentence-level documents for the article. With LCF I post about 2.5/3k articles (files) per hour -- so about 2.5k*500 /3600 = 350 s per second post-rate -- is this normal/expected?? Eventually, after about 3000 files (an hour or so) Solr starts to hang/becomes unresponsive and with Jconsole/GC logging I can see that the Old-Gen space is about 90% full and the following is the end of the solr log file-- where you can see GC has been called: -- 3012.290: [GC Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 53349392 Max Chunk Size: 3200168 Number of Blocks: 66 Av. Block Size: 808324 Tree Height: 13 Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 0 Max Chunk Size: 0 Number of Blocks: 0 Tree Height: 0 3012.290: [ParNew (promotion failed): 143071K->142663K(153344K), 0.0769802 secs]3012.367: [CMS -- I can replicate this with Solr using "text" field types in place of those that use my custom analysers -- whereby Solr takes longer to become unresponsive (about 3 hours / 13k docs) but there is the same kind of GC message at the end of the log file / Jconsole shows that the Old-Gen space was almost full so was due for a collection sweep. I don't use any special GC settings but found an article here: http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot-camp-draft/ that suggests using particular GC settings for Solr -- I will try these but thought someone else could suggest anoth
Re: Analysing SOLR logfiles
we've just started using awstats - as suggested by the solr 1.4 book. its open source!: http://awstats.sourceforge.net/ On 12 August 2010 18:18, Jay Flattery wrote: > Thanks - splunk looks overkill. > We're extremely small scale - were hoping for something open source :-) > > > - Original Message > From: Jan Høydahl / Cominvent > To: solr-user@lucene.apache.org > Sent: Wed, August 11, 2010 11:14:37 PM > Subject: Re: Analysing SOLR logfiles > > Have a look at www.splunk.com > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Training in Europe - www.solrtraining.com > > On 11. aug. 2010, at 19.34, Jay Flattery wrote: > >> Hi there, >> >> >> Just wondering what tools people use to analyse SOLR log files. >> >> We're looking to do things like extracting common queries, calculating >>averaging >> >> >> Qtime and hits, returning particularly slow/expensive queries, etc. >> >> Would prefer not to code something (completely) from scratch. >> >> Thanks! >> >> >> >> > > > > >
Re: Indexing Hanging during GC?
sorry -- i used the term "documents" too loosely! 180k scientific articles with between 500-1000 sentences each and we index sentence-level index documents so i'm guessing about 100 million lucene index documents in total. an update on my progress: i used GC settings of: -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8 -XX:CMSInitiatingOccupancyFraction=70 which allowed the indexing process to run to 11.5k articles and for about 2hours before I got the same kind of hanging/unresponsive Solr with this as the tail of the solr logs: Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 2416734 Max Chunk Size: 2412032 Number of Blocks: 3 Av. Block Size: 805578 Tree Height: 3 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.193 secs]5980.480: [CMS I also saw (in jconsole) that the number of threads rose from the steady 32 used for the 2 hours to 72 before Solr finally became unresponsive... i've got the following GC info params switched on (as many as i could find!): -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:PrintFLSStatistics=1 with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875 million fairly small docs per hour!! this produced an index of about 40GB to give you an idea of index size... because i've already got the documents in solr native xml format i.e. one file per article each with ... i.e. posting each set of sentence docs per article in every LCF file post... this means that LCF can throw documents at Solr very fast and i think i'm breaking it GC-wise. i'm going to try adding in System.gc() calls to see if this runs ok (albeit slower)... otherwise i'm pretty much at a loss as to what could be causing this GC issue/ solr hanging if it's not a GC issue... thanks :) bec On 12 August 2010 21:42, dc tech wrote: > I am a little confused - how did 180k documents become 100m index documents? > We use have over 20 indices (for different content sets), one with 5m > documents (about a couple of pages each) and another with 100k+ docs. > We can index the 5m collection in a couple of days (limitation is in > the source) which is 100k documents an hour without breaking a sweat. > > > > On 8/12/10, Rebecca Watson wrote: >> Hi, >> >> When indexing large amounts of data I hit a problem whereby Solr >> becomes unresponsive >> and doesn't recover (even when left overnight!). I think i've hit some >> GC problems/tuning >> is required of GC and I wanted to know if anyone has ever hit this problem. >> I can replicate this error (albeit taking longer to do so) using >> Solr/Lucene analysers >> only so I thought other people might have hit this issue before over >> large data sets >> >> Background on my problem follows -- but I guess my main question is -- can >> Solr >> become so overwhelmed by update posts that it becomes completely >> unresponsive?? >> >> Right now I think the problem is that the java GC is hanging but I've >> been working >> on this all week and it took a while to figure out it might be >> GC-based / wasn't a >> direct result of my custom analysers so i'd appreciate any advice anyone has >> about indexing large document collections. >> >> I also have a second questions for those in the know -- do we have a chance >> of indexing/searching over our large dataset with what little hardware >> we already >> have available?? >> >> thanks in advance :) >> >> bec >> >> a bit of background: --- >> >> I've got a large collection of articles we want to index/search over >> -- about 180k >> in total. Each article has say 500-1000 sentences and each sentence has >> about >> 15 fields, many of which are multi-valued and we store most fields as well >> for >> display/highlighting purposes. So I'd guess over 100 million index >> documents. >> >> In our small test collection of 700 articles this results in a single index >> of >> about 13GB. >> >> Our pipeline processes PDF files through to Solr native xml which we call >> "index.xml" files i.e. in ... format ready to post straight to >> Solr's >> update handler. >> >> We create the index.xml files as we pull in information from >> a few sources and creation of these files from their original PDF form is >> farmed out across a grid and is quite time-consuming so we distribute this >>
Re: Indexing Hanging during GC?
hi, > 1) I assume you are doing batching interspersed with commits as each file I crawl for are article-level each contains all the sentences for the article so they are naturally batched into the about 500 documents per post in LCF. I use auto-commit in Solr: 50 90 > 2) Why do you need sentence level Lucene docs? that's an application specific need due to linguistic info needed on a per-sentence basis. > 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be > surprised if you a memory/connection leak their (or it is not > releasing some resource explicitly) I thought this could be the case too -- but if I replace the use of my custom analysers and specify my fields are of type "text" instead (from standard solrconfig.xml i.e. using solr-based analysers) then I get this kind of hanging too -- at least it did when I didn't have any explicit GC settings... it does take longer to replicate as my analysers/field types are more complex than "text" field type. i will try it again with the different GC settings tomorrow and post the results. > In general, we have NEVER had a problem in loading Solr. i'm not sure if we would either if we posted as we created the index.xml format... but because we post 500+ documents a time (one article file per LCF post) and LCF can post these files quickly i'm not sure if I need to try and slow down the post rate!? thanks for your replies, bec :) > On 8/12/10, Rebecca Watson wrote: >> sorry -- i used the term "documents" too loosely! >> >> 180k scientific articles with between 500-1000 sentences each >> and we index sentence-level index documents >> so i'm guessing about 100 million lucene index documents in total. >> >> an update on my progress: >> >> i used GC settings of: >> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSPermGenSweepingEnabled >> -XX:NewSize=2g -XX:MaxNewSize=2g -XX:SurvivorRatio=8 >> -XX:CMSInitiatingOccupancyFraction=70 >> >> which allowed the indexing process to run to 11.5k articles and >> for about 2hours before I got the same kind of hanging/unresponsive Solr >> with >> this as the tail of the solr logs: >> >> Before GC: >> Statistics for BinaryTreeDictionary: >> >> Total Free Space: 2416734 >> Max Chunk Size: 2412032 >> Number of Blocks: 3 >> Av. Block Size: 805578 >> Tree Height: 3 >> 5980.480: [ParNew: 1887488K->1887488K(1887488K), 0.193 secs]5980.480: >> [CMS >> >> I also saw (in jconsole) that the number of threads rose from the >> steady 32 used for the >> 2 hours to 72 before Solr finally became unresponsive... >> >> i've got the following GC info params switched on (as many as i could >> find!): >> -XX:+PrintClassHistogram -XX:+PrintGCDetails -XX:+PrintGCTimeStamps >> -XX:+PrintGCApplicationConcurrentTime >> -XX:+PrintGCApplicationStoppedTime >> -XX:PrintFLSStatistics=1 >> >> with 11.5k docs in about 2 hours this was 11.5k * 500 / 2 = 2.875 >> million fairly small >> docs per hour!! this produced an index of about 40GB to give you an >> idea of index >> size... >> >> because i've already got the documents in solr native xml format >> i.e. one file per article each with ... >> i.e. posting each set of sentence docs per article in every LCF file post... >> this means that LCF can throw documents at Solr very fast and i think >> i'm >> breaking it GC-wise. >> >> i'm going to try adding in System.gc() calls to see if this runs ok >> (albeit slower)... >> otherwise i'm pretty much at a loss as to what could be causing this GC >> issue/ >> solr hanging if it's not a GC issue... >> >> thanks :) >> >> bec >> >> On 12 August 2010 21:42, dc tech wrote: >>> I am a little confused - how did 180k documents become 100m index >>> documents? >>> We use have over 20 indices (for different content sets), one with 5m >>> documents (about a couple of pages each) and another with 100k+ docs. >>> We can index the 5m collection in a couple of days (limitation is in >>> the source) which is 100k documents an hour without breaking a sweat. >>> >>> >>> >>> On 8/12/10, Rebecca Watson wrote: >>>> Hi, >>>> >>>> When indexing large amounts of data I hit a problem whereby Solr >>>> becomes unresponsive >>>> and doesn't recover (even when left overnight!). I think i've hit some >>>> GC pr
Re: Indexing Hanging during GC?
hi, ok I have a theory about the cause of my problem -- java's GC failure I think is due to a solr memory leak caused from overlapping auto-commit calls -- does that sound plausible?? (ducking for cover now...) I watched the log files and noticed that when the threads start to increase (from a stable 32 or so up to 72 before hanging!) there are two commit calls too close to each other + it looked like the index is in the process of merging at the time of the first commit call -- i.e. first was a long commit call with merge required then before that one finished another commit call was issued. i think this was due to the autocommit settings I had: 50 90 and eventually, it seems these two different auto-commit settings would coincide!! a few times this seems to happen and not cause a problem -- but I think two eventually coincide where the first one is doing something heavy-duty like a merge over large index segments and so the system spirals downwards combined with the fact I was posting to Solr as fast as possible (LCF was waiting for Solr) --> i think this causes java to keel over and die. Two things were noticeable in Jconsole - 1) lots of threads were spawned with the two commit calls - the thread spawing started after the first commit call making me think it was a commit requiring an index merge... whereby threads overall went from the stable 32 used during indexing for the 2 hours prior to 72 or so within 15 minutes after the two commit calls were made... 2) both Old-gen/survivor heaps were almost totally full! so i think a memory leak is happening with overlapping commit calls + heavy duty lucene index processing behind solr (like index merge!?) So if the overlapping commit call (second commit called before first one finished) caused a memory leak and with old-gen/survivor heaps full at that point, Solr became unresponsive and never recovered. is this expected when you use both autocommit settings / if concurrent commit calls are issued to Solr? This explains why it was happening even if without the use of my custom analysers ("text" field type used in place of mine) but took longer to happen --> my analysers are more expensive CPU/RAM-wise so the overlapping commit calls were less likely to be forgiven as my system was already using a lot of RAM... Also, I played with the GC settings a bit where I could find settings that helped to postpone this issue as they were more forgiving to the increased RAM usage during overlapping commit calls (GC settings with increased eden heap space). Solr was hanging after about 14k files (each one an article with a set of that are each sentences in the article) with a total of about 7 million index documents. If i switch off both auto-commit settings I can get through my smallish 20k file set (10 million index s) in 4 hours. I'm Trying to run now on 100k articles (50 million index within 100k files) where I use LCF to crawl/post each file to Solr so i'll email an update about this. if this works ok i'm then going to try using only one auto-commit setting rather than two and see if this works ok. thanks :) bec On 13 August 2010 00:24, Rebecca Watson wrote: > hi, > >> 1) I assume you are doing batching interspersed with commits > > as each file I crawl for are article-level each contains all the > sentences for the article so they are naturally batched into the about > 500 documents per post in LCF. > > I use auto-commit in Solr: > > 50 > 90 > > >> 2) Why do you need sentence level Lucene docs? > > that's an application specific need due to linguistic info needed on a > per-sentence > basis. > >> 3) Are your custom handlers/parsers a part of SOLR jvm? Would not be >> surprised if you a memory/connection leak their (or it is not >> releasing some resource explicitly) > > I thought this could be the case too -- but if I replace the use of my custom > analysers and specify my fields are of type "text" instead (from standard > solrconfig.xml i.e. using solr-based analysers) then I get this kind of > hanging > too -- at least it did when I didn't have any explicit GC settings... it does > take longer to replicate as my analysers/field types are more complex than > "text" field type. > > i will try it again with the different GC settings tomorrow and post > the results. > >> In general, we have NEVER had a problem in loading Solr. > > i'm not sure if we would either if we posted as we created the > index.xml format... > but because we post 500+ documents a time (one article file per LCF post) and > LCF can post these files quickly i'm not sure if I need to try and slow down > the post rate!? > > thanks for your replies, > > bec :) > >> On 8/12/10, Rebe
tii RAM usage on startup
hi, I am running solr 1.4.1 and java 1.6 with 6GB heap and the following GC settings: gc_args="-XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:NewSize=2g -XX:MaxNewSize=2g -XX:CMSInitiatingOccupancyFraction=60" So 6GB total heap and 2GB allocated to eden space. I have caching, autocommit and auto-warming commented out of solrconfig.xml After I index 500k docs and call commit/optimize (via URL after indexing has completed) my RAM usage is only about 1.5GB, but then if I stop and restart my Solr server over the same data the RAM immediately jumps to about 4GB and I can't understand why there is a difference here? As this is close to the old gen limit -- i quickly find that Solr becomes unresponsive. The following shows that tii files are being loaded from 26MB files to consume over 200MB in RAM when I restart the server. is this expected? thanks for any help/advice in advance, bec :) - Rebecca-Watsons-iMac:work iwatson$ jmap -histo:live 8992 | head -30 num #instances #bytes class name -- 1: 18334714 1422732624 [C 2: 18332491 733299640 java.lang.String 3: 6104929 244197160 org.apache.lucene.index.TermInfo 4: 6104929 244197160 org.apache.lucene.index.TermInfo 5: 6104929 244197160 org.apache.lucene.index.TermInfo 6: 6104921 195357472 org.apache.lucene.index.Term 7: 6104921 195357472 org.apache.lucene.index.Term 8: 6104921 195357472 org.apache.lucene.index.Term 9: 224 146527408 [J 10:10 48839592 [Lorg.apache.lucene.index.TermInfo; 11:10 48839592 [Lorg.apache.lucene.index.Term; 12:10 48839592 [Lorg.apache.lucene.index.TermInfo; 13:10 48839592 [Lorg.apache.lucene.index.TermInfo; 14:10 48839592 [Lorg.apache.lucene.index.Term; 15:10 48839592 [Lorg.apache.lucene.index.Term; 16: 416306264728 17: 416305005104 18: 40494596352 19: 40493049984 20: 31292580040 21: 497132418496 22: 49831067192 [B 23: 4381 806104 java.lang.Class 24: 5979 533064 [[I 25: 6124 438080 [S 26: 7951 381648 java.util.HashMap$Entry 27: 2071 375744 [Ljava.util.HashMap$Entry; Rebecca-Watsons-iMac:work iwatson$ ls ./mach-lcf/data/data-serv-lcf/artdoc1/index/*.tii -rw-r--r-- 1 iwatson staff26M 18 Aug 23:44 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_36.tii -rw-r--r-- 1 iwatson staff26M 19 Aug 00:06 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_69.tii -rw-r--r-- 1 iwatson staff25M 19 Aug 00:26 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_9d.tii -rw-r--r-- 1 iwatson staff24M 19 Aug 00:50 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_ch.tii -rw-r--r-- 1 iwatson staff25M 19 Aug 01:11 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_fj.tii -rw-r--r-- 1 iwatson staff 3.1M 19 Aug 01:12 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_fq.tii -rw-r--r-- 1 iwatson staff 3.1M 19 Aug 01:12 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_g1.tii -rw-r--r-- 1 iwatson staff 167B 19 Aug 01:10 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_gb.tii -rw-r--r-- 1 iwatson staff 3.1M 19 Aug 01:11 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_gc.tii -rw-r--r-- 1 iwatson staff 223K 19 Aug 01:23 ./mach-lcf/data/data-serv-lcf/artdoc1/index/_gd.tii
Re: Indexing Hanging during GC?
hi all, in case anyone is having similar issues now / in the future -- here's what I think is at least part of the problem: once I commit the index, the RAM requirement jumps because the .tii files are loaded in at that point and because i have a very large number of unique terms I use 200MB+ of RAM for every tii file (even though they are only about 25MB on disk due to the number of unique terms this results in a large memory requirement when they are loaded in). (thanks to people on the solr-user lists answering my question on this -- search for subject "tii RAM usage on startup"). so when I had auto-commit on, my RAM was slowing disappearing and eventually Solr hangs because the tii files are too big to load into memory. the suggestion from my other thread was to try solr/lucene trunk (as i'm using solr 1.4.1 and they have reduced the memory footprint within flexible indexing in Lucene) OR to increase the term index interval so I will try one/both of these and see if this means I can increase the number of documents I can index given my current hardware (6GB RAM) where these docs have a lot of unique terms! thanks :) bec On 13 August 2010 19:15, Rebecca Watson wrote: > hi, > > ok I have a theory about the cause of my problem -- java's GC failure > I think is due > to a solr memory leak caused from overlapping auto-commit calls -- > does that sound > plausible?? (ducking for cover now...) > > I watched the log files and noticed that when the threads start to increase > (from a stable 32 or so up to 72 before hanging!) there are two commit calls > too close to each other + it looked like the index is in the process > of merging at the time > of the first commit call -- i.e. first was a long commit call with > merge required > then before that one finished another commit call was issued. > > i think this was due to the autocommit settings I had: > > 50 > 90 > > > and eventually, it seems these two different auto-commit settings > would coincide!! > a few times this seems to happen and not cause a problem -- but I think two > eventually coincide where the first one is doing something heavy-duty > like a merge > over large index segments and so the system spirals downwards > > combined with the fact I was posting to Solr as fast as possible (LCF > was waiting > for Solr) --> i think this causes java to keel over and die. > > Two things were noticeable in Jconsole - > 1) lots of threads were spawned with the two commit calls - the thread > spawing started > after the first commit call making me think it was a commit requiring > an index merge... > whereby threads overall went from the stable 32 used during > indexing for the 2 hours prior to 72 or so within 15 minutes after the > two commit calls > were made... > > 2) both Old-gen/survivor heaps were almost totally full! so i think a > memory leak > is happening with overlapping commit calls + heavy duty lucene index > processing > behind solr (like index merge!?) > > So if the overlapping commit call (second commit called before first > one finished) > caused a memory leak and with old-gen/survivor heaps full > at that point, Solr became unresponsive and never recovered. > > is this expected when you use both autocommit settings / if concurrent commit > calls are issued to Solr? > > This explains why it was happening even if without the use of my > custom analysers > ("text" field type used in place of mine) but took longer to happen > --> my analysers > are more expensive CPU/RAM-wise so the overlapping commit calls were less > likely > to be forgiven as my system was already using a lot of RAM... > > Also, I played with the GC settings a bit where I could find settings > that helped > to postpone this issue as they were more forgiving to the increased > RAM usage during > overlapping commit calls (GC settings with increased eden heap space). > > Solr was hanging after about 14k files (each one an article with a set > of that > are each sentences in the article) with a total of about > 7 million index documents. > > If i switch off both auto-commit settings I can get through my > smallish 20k file set (10 million index > s) in 4 hours. > > I'm Trying to run now on 100k articles (50 million index within > 100k files) > where I use LCF to crawl/post each file to Solr so i'll email an > update about this. > > if this works ok i'm then going to try using only one auto-commit > setting rather than two and see > if this works ok. > > thanks :) > > bec > > > On 13 August 2010 00:24, Rebecca Watson wrote: >> hi, >> >>> 1) I assume you are doing batching intersp
Sorting Facets by First Occurrence
I'm working on replacing a custom, internal search implementation with Solr. I'm having great success, with one small exception. When implementing our version of faceting, one of our facets had a peculiar sort order. It was dictated by the order in which the field occurred in the results. The first time a value occurred it was added to the list and regardless of the number of times it occurred, it always stayed at the top. For example, if a search yielded 10 results, 1 - 10, and hit 1 is in category 'Toys', hit 2 through 9 are in 'Sports' and the last is in 'Household' then the facet would look like: facet.fields -> category -> [ Toys: 1, Sports: 8 Household: 1 ] The facet.sort only gives me the option to sort highest count first or alphabetically. So, the question I _really_ have is: how can I implement this feature? I could examine the results i'm returned and create my own facet order from it, but I thought this might be useful for others. I don't know my way around Solr's source, so I though dropping a note to the list would be faster than code spelunking with no light. -- Cory 'G' Watson http://www.onemogin.com
Negative boosts
Hello, I have been developing a new search application based on Solr (Very nice!) using dismax. We are using query-time boosts to provide better search results for user queries and index-time boosts to promote certain documents over others. My question is about the latter: We have a "position" field available at index time that is an integer value, 0 being the 1st position, 1 being the 2nd position, 99 being the hundredth, etc. How do I get Solr to return documents in that same order? Is it possible to apply a negative boost? ... ... ... I notice in the documentation for parseFieldBoosts that the routine "Doesn't care if boost info is negative, you're on your own.", but what does that mean? There is a desire to preserve the order as 0..xx and not reverse it (which would be the obvious choice - x becomes 0, 0 becomes x) because we are adding multiple sets of positions to the index, and I want the first position of each set to be equal (zero). This weighting is important to us for user queries as well as filtered views. Is there another way to get what I'm looking for? Thanks, Derek
Re: Negative boosts
If you want documents returned in the same order as a field, it's easy... you sort! If you want the value of a field to influence a score, not determine the exact sort order, you can use FunctionQuery (currently hacked into the query parser as _val_:myfield) That seems like what I want -- boosting and not sorting. Is there a function that will give me a bigger boost for field values closer to zero?
Re: Sorting Facets by First Occurrence
On Nov 30, 2009, at 5:15 PM, Chris Hostetter wrote: > All of Solr's existing faceting code is based on the DocSet which is an > unordered set of all matching documents -- i suspect your existing > application only reordered the facets based on their appearance in the > first N docs (possibly just the first page, but maybe more) so doing > something like that using the DocList would certainly be feasible. if > your number of facet constraints is low enough that they are all returned > everytime then doing it in the client is probably the easiest -- but if > you have to worry about facet.limit preventing something from being > returned that might otherwise bubble to the top of your list when you > reorder it then you'll need to customise the FacetComponent. You are right, I left out a few important bits there. Tried to be brief and succeeding in being vague? :) Effectively I was ordering the facet based on the N documents in the current "page". My thought that his was a good feature for a facet now seems incorrect, as my needs are limited to the current page, not the whole set of results. I'll probably elect to fetching data from the facets based on the page of documents I'm showing. Thanks for the discussion, it helped! :) Cory G Watson http://www.onemogin.com