Re: how to get all the docIds in the search result?
That's the only way I can think of doing it through Solr. What is the configuration of the handler you're calling? It could be that highlighting or faceting are turned on and slowing down your query. Toby. On 23 Jul 2009, at 10:35, shb wrote: I have tried the following code: query.setRows(Integer.MAX_VALUE); query.setFields("id"); when it return 1000,000 records, it will take about 22s. This is very slow. Is there any other way? 2009/7/23 Toby Cole Have you tried limiting the fields that you're requesting to just the ID? Something along the line of: query.setRows(Integer.MAX_VALUE); query.setFields("id"); Might speed the query up a little. On 23 Jul 2009, at 09:11, shb wrote: Here id is indeed the uniqueKey of a document. I want to get all the ids for some other useage. 2009/7/23 Shalin Shekhar Mangar On Thu, Jul 23, 2009 at 1:09 PM, shb wrote: if I use query.setRows(Integer.MAX_VALUE); the query will become very slow, because searcher will go to fetch the filed value in the index for all the returned document. So if I set query.setRows(10), is there any other ways to get all the ids? thanks You should fetch as many rows as you need and not more. Why do you need all the ids? I'm assuming that by id you mean the uniqueKey of a document. -- Regards, Shalin Shekhar Mangar. -- Toby Cole Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/
solr-user@lucene.apache.org
Any chance of getting that stack trace as more than one line? :) Also, where are you posting your documents from? (e.g. Java, PHP, command line etc). It sounds like you're not using 'entities' for your '&' characters (ampersands) in your XML. These should be converted to "&" This should look familiar if you've ever written any HTML. On 30 Jul 2009, at 09:44, Jörg Agatz wrote: Good Morning SolR :-) its morning in Germany! i have a Problem, with the Indexing... I often become an Error. I think it is because in the XML stand this Character "&" I need the Character, what happens? SimplePostTool: FATAL: Solr returned an error: comctcwstxexcWstxLazyException_Unexpected_character___code_32_missing_ name__at_rowcol_unknownsource_1465__comctcwstxexcWstxLazyException_com ctcwstxexcWstxUnexpectedCharException_Unexpected_character___code_32_m issing_name__at_rowcol_unknownsource_1465__at_comctcwstxexcWstxLazyExc eptionthrowLazilyWstxLazyExceptionjava45__at_comctcwstxsrStreamScanner throwLazyErrorStreamScannerjava729__at_comctcwstxsrBasicStreamReadersa feFinishTokenBasicStreamReaderjava3659__at_comctcwstxsrBasicStreamRead ergetTextBasicStreamReaderjava809__at_orgapachesolrhandlerXMLLoaderrea dDocXMLLoaderjava278__at_orgapachesolrhandlerXMLLoaderprocessUpdateXML Loaderjava139__at_orgapachesolrhandlerXMLLoaderloadXMLLoaderjava69__at _orgapachesolrhandlerContentStreamHandlerBasehandleRequestBodyContentS treamHandlerBasejava54__at_orgapachesolrhandlerRequestHandlerBasehandl eRequestRequestHandlerBasejava131__at_orgapachesolrcoreSolrCoreexecuteSolrCorejava1299__at_orgapachesolrservletSolrDispatchFilterexecuteSolrDispatchFilterjava338__at_orgapachesolrservletSolrDispatchFilterdoFilterSolrDispatchFilterjava241__at_orgmortbayjettyservletServletHandler$CachedChaindoFilterServletHandlerjava1089__at_orgmortbayjettyservletServletHandlerhandleServletHandlerjava365__at_orgmortbayjettysecuritySecurityHandlerhandleSecurityHandlerjava216__at_orgmortbayjettyservletSessionHandlerhandleSessionHandlerjava181__at_orgmortbayjettyhandlerContextHandlerhandleContextHandlerjava712__at_orgmortbayjettywebappWebAppContexthandleWebAppContextjava405__at_orgmortbayjettyhandlerContextHandlerCollectionhandleContextHandlerCollectionjava211__at_orgmortbayjettyhandlerHandlerCollectionhandleHandlerCollectionjava114__at_orgmortbayjettyhandlerHandlerWrapperhandleHandlerWrapperjava139__at_orgmortbayjettyServerhandleServerjava285__at_orgmortbayjettyHt _ -- Toby Cole Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/
solr-user@lucene.apache.org
On 30 Jul 2009, at 11:17, Jörg Agatz wrote: It sounds like you're not using 'entities' for your '&' characters (ampersands) in your XML. These should be converted to "&" This should look familiar if you've ever written any HTML. I dont understand this i musst change even & to & ? Yes, '&' characters aren't allowed in XML unless they are either in a CDATA section or part of an 'entity'. A good place to read up on this is: http://www.xml.com/pub/a/2001/01/31/qanda.html In short, replace all your & with & -- Toby Cole Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/
Re: Delete solr data from disk space
Hi Anish, Have you optimized your index? When you delete documents in lucene they are simply marked as 'deleted', they aren't physically removed from the disk. To get the disk space back you must run an optimize, which re-writes the index out to disk without the deleted documents, then deletes the original. Toby On 4 Aug 2009, at 14:41, Ashish Kumar Srivastava wrote: Hi , Sorry!! But this solution will not work because I deleted data by certain query. Then how can i know which files should be deleted. I cant delete whole data. Markus Jelsma - Buyways B.V. wrote: Hello, A rigorous but quite effective method is manually deleting the files in your SOLR_HOME/data directory and reindex the documents you want. This will surely free some diskspace. Cheers, - Markus Jelsma Buyways B.V. Tel. 050-3118123 Technisch ArchitectFriesestraatweg 215c Fax. 050-3118124 http://www.buyways.nl 9743 AD GroningenKvK 01074105 On Tue, 2009-08-04 at 06:26 -0700, Ashish Kumar Srivastava wrote: I am facing a problem in deleting solr data form disk space. I had 80Gb of of solr data. I deleted 30% of these data by using query in solr-php client and committed. Now deleted data is not visible from the solr UI but used disk space is still 80Gb for solr data. Please reply if you have any solution to free the disk space after deleting some solr data. Thanks in advance. -- View this message in context: http://www.nabble.com/Delete-solr-data-from-disk-space-tp24808676p24808883.html Sent from the Solr - User mailing list archive at Nabble.com. -- Toby Cole Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/
Re: Proximity Search
See the Lucene query parser syntax documentation: http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Proximity%20Searches basically... "shell petroleum"~10 should do the trick (if you're using a standard request handler, can't remember if dismax supports proximity). On 18 Aug 2009, at 13:28, Ninad Raut wrote: Hi, I want to count the words between two significant words like "shell" and "petroleum". Or want to write a query to find all the documents where the content has "shell" and "petroleum" in close proximity of less than 10 words between them. Can such quries be created in Solr? Regards, Ninad Raut. -- Toby Cole Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/
Re: Can I search for a term in any field or a list of fields?
I would consider using the dismax query handler. This allows you to send a list of keywords or phrases along with the fields to search over. e.g., you could use ?qt=dismax&q=foo&qf=title+text+keywords+concept More details here: http://wiki.apache.org/solr/DisMaxRequestHandler On 18 Aug 2009, at 15:56, Paul Tomblin wrote: So if I want to make it so that the default search always searches three specific fields, I can make another field multi-valued that they are all copied into? On Tue, Aug 18, 2009 at 10:46 AM, Marco Westermann wrote: I would say, you should use the copyField tag in the schema. eg: the text-field has to be difined as multivalued=true. When you now do an unqualified search, it will search every field, which is copied to the text-field. -- http://www.linkedin.com/in/paultomblin -- Toby Cole Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/
Re: Status of Spelt integration
Hi Andrew, We ended up abandoning the spelt integration as the built in solr spellchecking improved so much during our project. Also, if you did go the route of using spelt, I'd implement it as a spellcheck plugin (which didn't exist as a concept when we started trying to shoehorn spelt into solr). Regards, Toby. On 30 Nov 2009, at 11:29, Andrey Klochkov wrote: Hi all I searched through the mail-list archives and saw that sometime ago Toby Cole was going to integrate a spellchecker named Spelt into Solr. Does anyone now what's the status of this? Anyone tried to use it with Solr? Does it make sense to try it instead of standard spell checker? Some links on the subject: http://markmail.org/message/cqt4qtzzwyceltqu#query:+page:1+mid:cqt4qtzzwyceltqu+state:results http://markmail.org/search/?q=spelt#query:spelt+page:1+mid:krzofzojhg7hmms7+state:results http://groups.google.com/group/spelt -- Andrew Klochkov Senior Software Engineer, Grid Dynamics
Re: Status of Spelt integration
I'm pretty sure this isn't a Solr related question. Have you tried asking on the eGroupware mailing lists? http://sourceforge.net/mail/?group_id=78745 Toby. On 7 Dec 2009, at 08:52, freerk55 wrote: The standard spell checker of Thunderbird works in eGroupware. But not in Felamimail!!?? Why not? How can I get it working as it does in the rest of eGroupware? Freerk Jongsma Toby Cole-2 wrote: Hi Andrew, We ended up abandoning the spelt integration as the built in solr spellchecking improved so much during our project. Also, if you did go the route of using spelt, I'd implement it as a spellcheck plugin (which didn't exist as a concept when we started trying to shoehorn spelt into solr). Regards, Toby. On 30 Nov 2009, at 11:29, Andrey Klochkov wrote: Hi all I searched through the mail-list archives and saw that sometime ago Toby Cole was going to integrate a spellchecker named Spelt into Solr. Does anyone now what's the status of this? Anyone tried to use it with Solr? Does it make sense to try it instead of standard spell checker? Some links on the subject: http://markmail.org/message/cqt4qtzzwyceltqu#query:+page:1+mid:cqt4qtzzwyceltqu+state:results http://markmail.org/search/?q=spelt#query:spelt+page:1+mid:krzofzojhg7hmms7+state:results http://groups.google.com/group/spelt -- Andrew Klochkov Senior Software Engineer, Grid Dynamics -- View this message in context: http://old.nabble.com/Status-of-Spelt-integration-tp26573196p26674324.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Field Collapsing - disable cache
If you take out the fieldCollapsing/fieldCollapseCache element in your config the fieldcollapse component will not use a cache. From http://wiki.apache.org/solr/FieldCollapsing#line-63 "If the field collapse cache is not configured then the field collapse logic will not be cached." Regards, Toby. On 22 Dec 2009, at 10:56, r...@intelcompute.com wrote: my solconfig can be seen at http://www.intelcompute.com/solrconfig.xml [1] On Tue 22/12/09 10:51 , r...@intelcompute.com wrote:Is it possible to disable the field collapsing cache? I'm trying to perform some speed tests, and have managed to comment out the filter, queryResult, and document caches successfully. on 1.5 ... ... collapse facet tvComponent ... - Message sent via Atmail Open - http://atmail.org/ [2]" target="_blank">http://atmail.org/ - Message sent via Atmail Open - http://atmail.org/ Links: -- [1] http://www.intelcompute.com/solrconfig.xml [2] http://mail.intelcompute.com/parse.php?redirect= -- Toby Cole Senior Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/
Re: Field Collapsing - disable cache
Which elements did you comment out? It could be the case that you need to get rid of the entire fieldCollapsing element, not just the fieldCollapsingCache element. (Disclaimer: I've not used field collapsing in anger before :) Toby. On 22 Dec 2009, at 11:09, r...@intelcompute.com wrote: That's what I assumed, but I'm getting the following error with it commented out MESSAGE null java.lang.NullPointerException at org .apache .solr .search .fieldcollapse .AbstractDocumentCollapser .createDocumentCollapseResult(AbstractDocumentCollapser.java:276) at org .apache .solr .search .fieldcollapse .AbstractDocumentCollapser .executeCollapse(AbstractDocumentCollapser.java:249) at org .apache .solr .search .fieldcollapse .AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java: 172) at org .apache .solr .handler .component.CollapseComponent.doProcess(CollapseComponent.java:173) at org .apache .solr .handler.component.CollapseComponent.process(CollapseComponent.java: 127) at org .apache .solr .handler .component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org .apache .solr .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org .apache .solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:336) at org .apache .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:239) at org .apache .catalina .core .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: 215) at org .apache .catalina .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org .apache .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 210) at org .apache .catalina.core.StandardContextValve.invoke(StandardContextValve.java: 172) at org .apache .catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org .apache .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org .apache .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java: 108) at org .apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java: 151) at org .apache.coyote.http11.Http11Processor.process(Http11Processor.java: 870) at org.apache.coyote.http11.Http11BaseProtocol $Http11ConnectionHandler.processConnection(Http11BaseProtocol.java: 665) at org .apache .tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java: 528) at org .apache .tomcat .util .net .LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool $ControlRunnable.run(ThreadPool.java:685) at java.lang.Thread.run(Thread.java:636) On Tue 22/12/09 11:02 , Toby Cole wrote:If you take out the fieldCollapsing/fieldCollapseCache element in your config the fieldcollapse component will not use a cache. From http://wiki.apache.org/solr/FieldCollapsing%23line-63 [1]" target="_blank">http://wiki.apache.org/solr/FieldCollapsing#line-63 "If the field collapse cache is not configured then the field collapse logic will not be cached." Regards, Toby. On 22 Dec 2009, at 10:56, wrote: my solconfig can be seen at http://www.intelcompute.com/solrconfig.xml [3]" target="_blank">http://www.intelcompute.com/solrconfig.xml [1] On Tue 22/12/09 10:51 , wrote:Is it possible to disable the field collapsing cache? I'm trying to perform some speed tests, and have managed to comment out the filter, queryResult, and document caches successfully. on 1.5 ... ... collapse facet tvComponent ... - Message sent via Atmail Open - http://atmail.org/ [5]" target="_blank">http://atmail.org/ [2]" target="_blank">http://atmail.org/ [6]" target="_blank">http://atmail.org/ - Message sent via Atmail Open - http://atmail.org/ [7]" target="_blank">http://atmail.org/ Links: -- [1] http://www.intelcompute.com/solrconfig.xml [8]" target="_blank">http://www.intelcompute.com/solrconfig.xml [2] http://mail.intelcompute.com/parse.php%3Fredirect%3D%26lt%3Ba [9]" target="_blank">http://mail.intelcompute.com/parse.php?redirect=http://blogs.semantico.com/discovery-blog/ - Message sent via Atmail Open - http://atmail.org/ Links: -- [1] http://mail.intelcompute.com/parse.php?redirect=http://mail.intelcompute.com/parse.php?redirect=http://mail.intelcompute.com/parse.php?redirect=http://mail.intelcompute.com/parse.php?redirect=http://mail.intelcompute.com/parse.php?redirect=http://mail.intelcompute.com/parse.php?redirect=http://mail.intelcompute.com/parse.php?redirect=http://mail.intelcompute.com/parse.php?redirect= -- Toby Cole Senior Software Engineer, Semantico Limited Registered in England and Wales n
Deadlock with DirectUpdateHandler2
Has anyone else experienced a deadlock when the DirectUpdateHandler2 does an autocommit? I'm using a recent snapshot from hudson (apache- solr-2008-11-12_08-06-21), and quite often when I'm loading data the server (tomcat 6) gets stuck at line 469 of DirectUpdateHandler2: // Check if there is a commit already scheduled for longer then this time if( pending != null && pending.getDelay(TimeUnit.MILLISECONDS) >= commitMaxTime ) Anyone got any enlightening tips? Cheers, Toby Cole Software Engineer Semantico Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE T: +44 (0)1273 358 238 F: +44 (0)1273 723 232 E: [EMAIL PROTECTED] W: www.semantico.com
Re: Deadlock with DirectUpdateHandler2
On 18 Nov 2008, at 20:18, Mark Miller wrote: Mike Klaas wrote: autoCommitCount is written in a CommitTracker.synchronized block only. It is read to print stats in an unsynchronized fashion, which perhaps could be fixed, though I can't see how it could cause a problem lastAddedTime is only written in a call path within a DirectUpdateHandler2.synchronized block. It is only read in a CommitTracker.synchronized block. It could read the wrong value, but I also don't see this causing a problem (a commit might fail to be scheduled). This could probably also be improved, but doesn't seem important. Right. I don't see these as causing a deadlock either, but whatever happens, its pretty much JVM undefined right, hence 'who knows' (I'll go with pretty doubtful ). I am not so sure its safe to read a value from an unsynced method whether you care about the result or not though. Its prob safe for atomic types and volatiles, but I'm fairly sure your playing with fire doing read/write in and out of sync. I don't think its just about stale values. But then again, it probably works 99.9% of the time or something. pending seems to be the issue. As long as commit are only triggered by autocommit, there is no issue as manipulation of pending is always performed inside CommitTracker.synchronized. But didCommit()/didRollback() could be called via manual commit, and pending is directly manipulated during DUH2.close(). I'm having trouble coming up with a plausible deadlock scenario, but this needs to be fixed. It isn't as easy as synchronizing didCommit/ didRollback, though--this would introduce definite deadlock scenarios. Mark, is there any chance you could post the thread dump for the deadlocked process? Do you issue manual commits during insertion? Toby reported it. Thread dump Toby? -Mike I'll try and post a thread dump when I get to work, can't remote in from here. I don't mind helping out with the fix, I've been getting to know solr's internals quite intimately recently after writing a few handlers/components for internal projects. T
Re: disappearing index
Could be that all your documents have not yet been committed. Have you tried running a commit? On 3 Dec 2008, at 15:00, Justin wrote: I built up two indexes using a multicore configuration one containing 52,000+ documents and the other over 10 million, the entire indexing process showed now errors. The server crashed over night, well after the indexing had completed, and now no documents are reported for either index. This despite the fact that the core's both have huge /data folders. (one is 1.5GB the other is 8.5GB). Any ideas? Toby Cole Software Engineer Semantico Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE T: +44 (0)1273 358 238 F: +44 (0)1273 723 232 E: [EMAIL PROTECTED] W: www.semantico.com
Re: Nightly build - 2008-12-17.tgz - build error - java.lang.NoClassDefFoundError: org/mozilla/javascript/tools/shell/Main
I came across this too earlier, I just deleted the contrib/javascript directory. Of course, if you need javascript library then you'll have to get it building. Sorry, probably not that helpful. :) Toby. On 17 Dec 2008, at 17:03, Kay Kay wrote: I downloaded the latest .tgz and ran $ ant dist docs: [mkdir] Created dir: /opt/src/apache-solr-nightly/contrib/ javascript/dist/doc [java] Exception in thread "main" java.lang.NoClassDefFoundError: org/mozilla/javascript/tools/shell/Main [java] at JsRun.main(Unknown Source) [java] Caused by: java.lang.ClassNotFoundException: org.mozilla.javascript.tools.shell.Main [java] at java.net.URLClassLoader$1.run(URLClassLoader.java: 200) [java] at java.security.AccessController.doPrivileged(Native Method) [java] at java.net.URLClassLoader.findClass(URLClassLoader.java:188) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:307) [java] at sun.misc.Launcher $AppClassLoader.loadClass(Launcher.java:301) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:252) [java] at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) [java] ... 1 more BUILD FAILED /opt/src/apache-solr-nightly/common-build.xml:335: The following error occurred while executing this line: /opt/src/apache-solr-nightly/common-build.xml:212: The following error occurred while executing this line: /opt/src/apache-solr-nightly/contrib/javascript/build.xml:74: Java returned: 1 and came across the above mentioned error. The class seems to be from the rhino (mozilla js ) library. Is it supposed to be packaged by default / is there a license restriction that prevents from being so . Toby Cole Software Engineer Semantico Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE T: +44 (0)1273 358 238 F: +44 (0)1273 723 232 E: toby.c...@semantico.com W: www.semantico.com
Re: How to select *actual* match from a multi-valued field
We came across this problem, unfortunately we gave up and did our hit- highlighting for multi-valued fields on the frontend. :-/ One approach would be to extend solr to return every value of a multi- valued field in the highlighting, regardless of whether that particular value matched. Just an idea, don't know if it's feasible or not. if anyone can point me in the right direction I could probably bash together a plugin and some tests. Toby. On 20 Jan 2009, at 16:31, Feak, Todd wrote: Anyone that can shed some insight? -Todd -Original Message- From: Feak, Todd [mailto:todd.f...@smss.sony.com] Sent: Friday, January 16, 2009 9:55 AM To: solr-user@lucene.apache.org Subject: How to select *actual* match from a multi-valued field At a high level, I'm trying to do some more intelligent searching using an app that will send multiple queries to Solr. My current issue is around multi-valued fields and determining which entry actually generated the "hit" for a particular query. For example, let's say that I have a multi-valued field containing people's names, associated with the document (trying to be non- specific on purpose). In one document, I have the following names: Jane Smith, Bob Smith, Roger Smith, Jane Doe. If the user performs a search for Bob Smith, this document is returned. What I want to know is that this document was returned because of "Bob Smith", not because of Jane or Roger. I've tried using the highlighting settings. They do provide some help, as the Jane Doe entry doesn't come back highlighted, but both Jane and Roger do. I've tried using hl.requireFieldMatch, but that seems to pertain only to fields, not entries within a multi- valued field. Using Solr, is there a way to get the information I am looking for? Specifically, that "Bob Smith" is the value in the multi-valued field that triggered the hit? -Todd Feak Toby Cole Software Engineer Semantico Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE T: +44 (0)1273 358 238 F: +44 (0)1273 723 232 E: toby.c...@semantico.com W: www.semantico.com
Re: what crawler do you use for Solr indexing?
Hi Tony, Strangely I started looking into the Solr/Nutch integration yesterday so I might be able to help :) The documentation for it is very sparse, but the trunk of nutch does have the solr integration committed. If I remember correctly, what I had to do was... I went through one of the nutch setup guides and set it up as if I wasn't going to use solr. (Can't remember which one, sorry). Copy the crawl script from here: http://www.foofactory.fi/files/nutch-solr/crawl.sh into my nutch directory. I was running this under the soy-latte JVM on OSX, and I had to modify the crawler a little to pick up filenames instead of permissions strings: This line was changed (note the 'cut' command) SEGMENT=`bin/hadoop dfs -ls $BASEDIR/segments|grep $BASEDIR|cut -d\ - f17|sort|tail -1` I also changed the second to last line to match the required parameters for the new solr indexer: bin/nutch org.apache.nutch.indexer.solr.SolrIndexer http://localhost:8983/solr/ $BASEDIR/crawldb $BASEDIR/linkdb $SEGMENT Copy the schema.xml from the nutch config directory into a fresh solr install & start it up. run the crawler.sh, and you should end up with content in your solr instance. I probably wont' be able to answer many nutch-related questions, but that's how I managed to get it up and running. Toby. On 6 Mar 2009, at 11:27, Andrzej Bialecki wrote: Tony Wang wrote: Hi Hoss, But I cannot find documents about the integration of Nutch and Solr in anywhere. Could you give me some clue? thanks Tony, I suggest that you follow Hoss's advice and ask these questions on nutch-user. This integration is built into Nutch, and not Solr, so it's less likely that people on this list know what you are talking about. This integration is quite fresh, too, so there are almost no docs except on the mailing list. Eventually someone is going to create some docs, and if you keep asking questions on nutch-user you will contribute to the creation of such docs ;) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com Toby Cole Software Engineer Semantico Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE T: +44 (0)1273 358 238 F: +44 (0)1273 723 232 E: toby.c...@semantico.com W: www.semantico.com
Re: Solr: ERRORs at Startup
quot;] 10:51:22,564 ERROR [STDERR] Mar 13, 2009 10:51:22 AM org.apache.solr.core.SolrCore parseListener INFO: Added SolrEventListener: org .apache .solr .core.QuerySenderListener{queries=[{q=fast_warm,start=0,rows=10}, {q=static firstSearcher warming query from solrconfig.xml}]} What am I missing? :-( Any idea? thanks in advance. Giovanni Toby Cole Software Engineer Semantico E: toby.c...@semantico.com W: www.semantico.com
Re: Storing "map" in Field
I don't think anything _quite_ like that exists, however you could use wildcard fields to achieve pretty much the same thing. You could use a post like this: SKU001 A Sample Product 119.99 109.99 if you have a field definition in your schema.xml like: stored="true"/> Regards, Toby. On 13 Mar 2009, at 14:01, Jeff Crowder wrote: All, I'm working with the sample schema, and have a scenario where I would like to store multiple prices in a "map" of some sort. This would be used for a scenario where a single "product" has different "prices" based on a price list. For instance: SKU001 A Sample Product 119.99 109.99 Is something like this possible? Regards, -Jeff Toby Cole Software Engineer Semantico E: toby.c...@semantico.com W: www.semantico.com
Re: why is query so slow
Peter, If possible try running a 1.4-snapshot of Solr, the faceting improvements are quite remarkable. However, if you can't run unreleased code, it might be an idea to try reducing the number of unique terms (try indexing surnames only?). Toby. On 17 Mar 2009, at 10:01, pcurila wrote: I am using 1.3 How many terms are in the wasCreatedBy_fct field? How is that field and its type configured? field contains author names and there are lots of them. here is type configuration: positionIncrementGap="100"> stored="true" multiValued="true"/> -- View this message in context: http://www.nabble.com/why-is-query-so-slow-tp22554340p22555842.html Sent from the Solr - User mailing list archive at Nabble.com. Toby Cole Software Engineer Semantico E: toby.c...@semantico.com W: www.semantico.com
Re: Index Creation Exception in solr
If you're using a recent 1.4-snapshot you should be able to do a rollback: https://issues.apache.org/jira/browse/SOLR-670 Otherwise, if you have unique IDs in your index, you can just post new documents over the top of the old ones then commit. Toby. On 18 Mar 2009, at 10:19, dabboo wrote: But if I already have some indexes in the index folder then these old indexes will also get deleted. Is there any way to roll back the operation. Shalin Shekhar Mangar wrote: On Wed, Mar 18, 2009 at 3:15 PM, dabboo wrote: Hi, I am creating indexes in Solr and facing an unusual issue. I am creating 5 indexes and xml file of 4th index is malformed. So, while creating indexes it properly submits index #1, 2 & 3 and throws exception after submission of index 4. I think you mean documents not indexes. Each document goes into the Lucene/Solr index. Now, if I look for index #1,2 & 3, it doesnt show up, which I think is happening because the operation is not committed yet. But these indexes must be lying somewhere temporarily in the Solr and I am not able to delete these indexes. Just delete the -- View this message in context: http://www.nabble.com/Index-Creation-Exception-in-solr-tp22575618p22576093.html Sent from the Solr - User mailing list archive at Nabble.com. Toby Cole Software Engineer Semantico E: toby.c...@semantico.com W: www.semantico.com
Re: Problem encoding ':' char in a solr query
You'll need to escape the colon with a backslash, e.g. fileAbsolutePath:file\:///Volumes/spare/ts/ford/schema/data/news/ fdw2008/jn71796.xml see the lucene query parser syntax page: http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Escaping%20Special%20Characters Toby. On 18 Mar 2009, at 11:28, Fergus McMenemie wrote: Hello I have a solr field:- stored="true" multiValued="false"/> which an unrelated query reveals is populated with:- file:///Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml however when I try and query for that exact document explicitly:- http://localhost:8080/apache-solr-1.4-dev/select?q=fileAbsolutePath:file%3a///Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml&wt=xml it fails. HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 'fileAbsolutePath:file:///Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml' : Encountered " ":" ": "" at line 1, column 21. Was expecting one of: ... ... ... "+" ... "-" ... "(" ... "*" ... "^" ... ... ... ... ... ... "[" ... "{" ... ... My encoding did not work! Help! -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer === Toby Cole Software Engineer Semantico E: toby.c...@semantico.com W: www.semantico.com
Re: UK Solr users meeting?
I know of a few people who'd be interested, we've got quite a few projects using Solr down here in Brighton. On 14 May 2009, at 10:41, Fergus McMenemie wrote: I was wondering if there is an interest in a UK (South East) solr user group meeting Please let me know if you are interested. I am happy to organize. Regards, Colin Yes Very interested. I am in lincolnshire. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer === Toby Cole Software Engineer Semantico Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE T: +44 (0)1273 358 238 F: +44 (0)1273 723 232 E: toby.c...@semantico.com W: www.semantico.com
Re: 1.4 Replication
I've not figured out a way to use basic auth with replication. We ended up using IP based auth, it shouldn't be too tricky to add basicauth support as, IIRC, the replication is based on the commons httpclient library. On 27 May 2009, at 15:17, Matthew Gregg wrote: On Wed, 2009-05-27 at 19:06 +0530, Noble Paul നോബിള് नोब्ळ् wrote: On Wed, May 27, 2009 at 6:48 PM, Matthew Gregg > wrote: Does replication in 1.4 support passing credentials/basic auth? If not what is the best option to protect replication? do you mean protecting the url /replication ? Yes I would like to put /replication behind basic auth, which I can do, but replication fails. I naively tried the obvious http://user:p...@host/replication, but that fails. ideally Solr is expected to run in an unprotected environment. if you wish to introduce some security it has to be built by you. I guess you meant Solr is expected to run in a "protected" environment? It's pretty easy to put up a basic auth in front of Solr, but the replication infra. in 1.4 doesn't seem to support it. Or does it, and I just don't know how? -- Matthew Gregg Toby Cole Software Engineer Semantico Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE W: www.semantico.com
Re: support for Payload Feature of lucene in solr
As i am new to solr and trying to explore payloads in solr but i haven't got any success on that. In one of the thread Grant mentioned solr have DelimitedPayloadTokenFilter which can store payloads at index time. But to make search on it we will require implementation of BoostingTermQuery extending SpanTermQuery . And if any other thing also we require. This looks about the same as the approach I'm about to use for our research. We're looking into using payloads to improve relevance for stemmed terms, using the payload to store the unstemmed term, boosting the term if there's an exact match with the payloads. My Question: 1. What all i will have to do for this. 2. How i will do this. I mean even if by adding some classes and rebuilding solr jars and then how i will prepare Document to index to store payloads and how i will build my search query to do payload search. Do we need to add a new Requesthandler for making such custom searches? Please provide a sample code if have any... -- Cheers Sumit I'm starting work on this in the next few days, I'll let you know how I get on. If anyone else has any experience with payloads in solr please chip in :) -- Toby Cole Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/
Re: how to get all the docIds in the search result?
Have you tried limiting the fields that you're requesting to just the ID? Something along the line of: query.setRows(Integer.MAX_VALUE); query.setFields("id"); Might speed the query up a little. On 23 Jul 2009, at 09:11, shb wrote: Here id is indeed the uniqueKey of a document. I want to get all the ids for some other useage. 2009/7/23 Shalin Shekhar Mangar On Thu, Jul 23, 2009 at 1:09 PM, shb wrote: if I use query.setRows(Integer.MAX_VALUE); the query will become very slow, because searcher will go to fetch the filed value in the index for all the returned document. So if I set query.setRows(10), is there any other ways to get all the ids? thanks You should fetch as many rows as you need and not more. Why do you need all the ids? I'm assuming that by id you mean the uniqueKey of a document. -- Regards, Shalin Shekhar Mangar. -- Toby Cole Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/
Re: Multiple Cores Vs. Single Core for the following use case
I've not looked at the filtering for quite a while, but if you're getting lots of similar queries, the filter's caching can play a huge part in speeding up queries, so even if the first query for "paris" was slow, subsequent queries from different users for the same terms will be sped up considerably (especially if you're using the FastLRUCache). IF filtering is slow for your queries, why not try simply using a boolean query (i.e, for the example below: "paris AND userId:123") this would remove the cross-user usefulness of the caches, if I understand them correctly, but may speed up uncached searches. Toby. On 27 Jan 2010, at 15:48, Matthieu Labour wrote: @Marc: Thank you marc. This is a logic we had to implement in the client application. Will look into applying the patch to replace our own grown logic @Trey: I have 1000 users per machine. 1 core / user. Each core is 35000 documents. Documents are small...each core goes from 100MB to 1.3GB at most. There are 7 types of documents. What I am trying to understand is the search/filter algorithm. If I have 1 core with all documents and I search for "Paris" for userId="123", is lucene going to first search for all Paris documents and then apply a filter on the userId ? If this is the case, then I am better off having a specific index for the user="123" because this will be faster --- On Wed, 1/27/10, Marc Sturlese wrote: From: Marc Sturlese Subject: Re: Multiple Cores Vs. Single Core for the following use case To: solr-user@lucene.apache.org Date: Wednesday, January 27, 2010, 2:22 AM In case you are going to use core per user take a look to this patch: http://wiki.apache.org/solr/LotsOfCores Trey-13 wrote: Hi Matt, In most cases you are going to be better off going with the userid method unless you have a very small number of users and a very large number of docs/user. The userid method will likely be much easier to manage, as you won't have to spin up a new core every time you add a new user. I would start here and see if the performance is good enough for your requirements before you start worrying about it not being efficient. That being said, I really don't have any idea what your data looks like. How many users do you have? How many documents per user? Are any documents shared by multiple users? -Trey On Tue, Jan 26, 2010 at 7:27 PM, Matthieu Labour wrote: Hi Shall I set up Multiple Core or Single core for the following use case: I have X number of users. When I do a search, I always know for which user I am doing a search Shall I set up X cores, 1 for each user ? Or shall I set up 1 core and add a userId field to each document? If I choose the 1 core solution then I am concerned with performance. Let's say I search for "NewYork" ... If lucene returns all "New York" matches for all users and then filters based on the userId, then this is going to be less efficient than if I have sharded per user and send the request for "New York" to the user's core Thank you for your help matt -- View this message in context: http://old.nabble.com/Multiple-Cores-Vs.-Single-Core-for-the-following-use-case-tp27332288p27335403.html Sent from the Solr - User mailing list archive at Nabble.com. -- Toby Cole Senior Software Engineer, Semantico Limited Registered in England and Wales no. 03841410, VAT no. GB-744614334. Registered office Lees House, 21-23 Dyke Road, Brighton BN1 3FE, UK. Check out all our latest news and thinking on the Discovery blog http://blogs.semantico.com/discovery-blog/