solr-user@lucene.apache.org
The 3rd turn... -- GMX.at - Österreichs FreeMail-Dienst mit über 2 Mio Mitgliedern E-Mail, SMS & mehr! Kostenlos: http://portal.gmx.net/de/go/atfreemail
Re: solr-user@lucene.apache.org
Excuse me again... it seems like my mail-provider has changed something. I hope this message won't pend again. Thank you. -- View this message in context: http://n3.nabble.com/solr-user-lucene-apache-org-tp679673p679684.html Sent from the Solr - User mailing list archive at Nabble.com.
How to use Payloads with Solr?
Hello community, since I have searched for a solution to get TermPositions in Solr, I became more aware of the "payload"-features. So I decided to learn more about payloads. In the wiki, there is not much said about them, so I will ask here at the mailing-list. It seems like Payloads are some extra-information for tokens, which I can customize in any way. For example, I could write a payloadFilter that gives the highest scoring-factor to the first token and the lowest to the last one. I also could say "oh, this word is a substantive. Add this as a payload-information: ". However: How do I use these information at query-time? How can I influence the scoring in Solr? I mean, I could write a payload-interpreter (Am I right to do so with AveragePayloadFunction from Lucene 2.9.1?) for scoring. So, if I do so, I can switch the scoring of all substantives without reindexing the payloads by setting there scoring-factor in the schema.xml (of course this will need some more extra-modifications). Can anybody tell me more about how to use payloads with Solr? For all the others, who want to learn some basic-information about payloads, I would suggest to read this article from Grant Ingersoll: http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ It is a really good tutorial and introduction to this topic. Unfortunately, it seems like he has not written anything about how to integrate this in Solr (I haven't find anything more). Kind regards - Mitch P.S. Sorry for double-posting. I had some problems with my mail-account and so there were some pending posts. It seems like they will pend for the next few years... but if not, excuse it, please. -- View this message in context: http://n3.nabble.com/How-to-use-Payloads-with-Solr-tp679691p679691.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrJ and HTMLStripCharFilterFactory
I think you're getting confused by the difference between indexing and storing. These are orthogonal operations for all they occur in the same definition. When you index something, the input is put through your analyzer chain, and the resulting tokens are stored after all appropriate transformations, which is what you're seeing when you look at your index through the admin panel and report the html is stripped. This is what's searched. But when you fetch a field that has been stored, the original raw text is returned. This is never searched, just kept around for retrieval. The idea here is to be able to have your index contain some displayable text. Think about the title of a book, for instance "The Grapes of Wrath". You want to search it after it's been lower-cased, stop words removed, etc. But if you wanted to present it to a user, you sure wouldn't want to display "grapes wrath" which might be the tokens after lowercasing and removing stopwords.. HTH Erick On Sat, Mar 27, 2010 at 1:13 AM, Indika Tantrigoda wrote: > Hello to all, > > I've been working with Solr for a few weeks and I have gotten indexing and > searching to work. > However I am having trouble with indexing HTML content and using > HTMLStripCharFilterFactory. > > My schema.xml looks like this > > > > > -- > /> > > and I am indexing the HTML content using SolrJ as the client (with Spring > being the framework). > > However when I do a search for all documents, the HTML content is also in > my > text field. > > But when I did an analysis using the Solr admin panel with HTML content it > shows the tokens extracted > properly with HTML tags removed. > > I found a similar issue at > http://www.mail-archive.com/solr-user@lucene.apache.org/msg28736.html > but I am still unable to get it working. I am using Solr 1.4 > > Any help regarding this is this much appreciated. > > Thanks in advance. > > Regards, > Indika >
Solr 1.4 bug? search fails but analyzer indicates a match
Ran into an odd situation today searching for a string like a domain name containing a '.', the Solr 1.4 analyzer tells me that I will get a match, but when I enter the search either in the client or directly in Solr, the search fails. Our default handler is dismax, but this also fails with the standard handler. So I'm wondering if this is a known issue, or am I missing something subtle in the analysis chain? Solr is 1.4.0 that I built. test string: Identi.ca queries that fail: IdentiCa, Identi.ca, Identi-ca query that matches: Identi ca I would expect all the queries that fail to match. Looking at the schema browser, the index contains the expected terms: identica, identi, ca schema in use is: http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34&content-type=text%2Fplain&view=co&pathrev=DRUPAL-6--1 Screen shots: analysis: http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: SolrJ and HTMLStripCharFilterFactory
Hi Erick, Thank you very much for the explanation. The example you gave made things clear. I ran some queries with my existing index and the results were as expected. Regards, Indika On 27 March 2010 17:09, Erick Erickson wrote: > I think you're getting confused by the difference between indexing and > storing. These are orthogonal operations for all they occur in the same > definition. > > When you index something, the input is put through your analyzer chain, and > the resulting tokens are stored after all appropriate transformations, > which > is what you're seeing when you look at your index through the admin panel > and report the html is stripped. This is what's searched. > > But when you fetch a field that has been stored, the original raw text is > returned. This is never searched, just kept around for retrieval. > > The idea here is to be able to have your index contain some displayable > text. Think about the title of a book, for instance "The Grapes of Wrath". > You want to search it after it's been lower-cased, stop words removed, etc. > But if you wanted to present it to a user, you sure wouldn't want to > display > "grapes wrath" which might be the tokens after lowercasing and removing > stopwords.. > > HTH > Erick > > On Sat, Mar 27, 2010 at 1:13 AM, Indika Tantrigoda >wrote: > > > Hello to all, > > > > I've been working with Solr for a few weeks and I have gotten indexing > and > > searching to work. > > However I am having trouble with indexing HTML content and using > > HTMLStripCharFilterFactory. > > > > My schema.xml looks like this > > > > positionIncrementGap="100"> > > > > > > -- > > /> > > > > and I am indexing the HTML content using SolrJ as the client (with Spring > > being the framework). > > > > However when I do a search for all documents, the HTML content is also in > > my > > text field. > > > > But when I did an analysis using the Solr admin panel with HTML content > it > > shows the tokens extracted > > properly with HTML tags removed. > > > > I found a similar issue at > > http://www.mail-archive.com/solr-user@lucene.apache.org/msg28736.html > > but I am still unable to get it working. I am using Solr 1.4 > > > > Any help regarding this is this much appreciated. > > > > Thanks in advance. > > > > Regards, > > Indika > > >
Re: DIH best pratices question
On Sat, Mar 27, 2010 at 3:25 AM, Blargy wrote: > > I have a items table on db1 and and item_descriptions table on db2. > > The items table is very small in the sense that it has small columns while > the item_descriptions table has a very large text field column. Both tables > are around 7 million rows > > What is the best way to import these into one document? > > > > > > > this is the right way > Or > > > > > > > > Or is there an alternative way? Maybe using the second way with a > CachedSqlEntityProcessor for the item entity? I don't think CachedSqlEntityProcessor helps here. > > Any thoughts are greatly appreciated. Thanks! > -- > View this message in context: > http://n3.nabble.com/DIH-best-pratices-question-tp677568p677568.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- - Noble Paul | Systems Architect| AOL | http://aol.com
Re: expungeDeletes on commit in Dataimport
On Thu, Mar 25, 2010 at 10:14 PM, Ruben Chadien wrote: > Hi > > I know this has been discussed before, but is there any way do > expungeDeletes=true when the DataImportHandler does the commit. expungeDeletes= true is not used does not mean that the doc does not get deleted.deleteDocByQuery does not do a commit. if you wish to commit you should do it explicitly > I am using the deleteDocByQuery in a Transformer when doing a delta-import > and as discussed before the documents are not deleted until restart. > > Also, how do i know in a Transformer if its running a Delta or Full Import , > i tries looking at Context. currentProcess() but that gives me "FULL_DUMP" > when doing a delta import...? the variable ${dataimporter.request.command} tells you which command is being run > > Thanks! > Ruben Chadien -- - Noble Paul | Systems Architect| AOL | http://aol.com
Re: ReplicationHandler reports incorrect replication failures
please create a bug On Fri, Mar 26, 2010 at 7:29 PM, Shawn Smith wrote: > We're using Solr 1.4 Java replication, which seems to be working > nicely. While writing production monitors to check that replication > is healthy, I think we've run into a bug in the status reporting of > the "../solr/replication?command=details" command. (I know it's > experimental...) > > Our monitor parses the replication?command=details XML and checks that > replication lag is reasonable by diffing the indexVersion of the > master and slave indices to make sure it's within a reasonable time > range. > > Our monitor also compares the first elements of > "indexReplicatedAtList" and "replicationFailedAtList" lists to see if > the last replication attempt failed. This is where we're having a > problem with the monitor throwing false errors. It looks like there's > a bug that causes successful replications to be considered failures. > The bug is triggered immediately after a slave restarts when the slave > is already in sync with the master. Each no-op replication attempt > after restart is considered a failure until something on the master > changes and replication has to actually do work. > > From the code, it looks like "SnapPuller.successfulInstall" starts out > false on restart. If the slave starts out in sync with the master, > then each no-op replication poll leaves "successfulInstall" set to > false which makes SnapPuller.logReplicationTimeAndConfFiles log the > poll as a failure. SnapPuller.successfulInstall stays false until the > first time replication actually has to do something, at which point it > gets set to true, and then everything is OK. > > Thanks, > Shawn > -- - Noble Paul | Systems Architect| AOL | http://aol.com
Re: Solr 1.4 bug? search fails but analyzer indicates a match
Hi Peter, have you tried to reindex your data and did you do a commit? If you changed anything, have you restarted your Solr-server? I can't understand why this problem occurs, since the example seem to work at analysis.jsp. Kind regards - Mitch -- View this message in context: http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680313.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: get synonyms
Have a look at the example-folder of your Solr-environment. There is a config-directory, where you can find a "Synonyms.txt"-example-file. Try to understand why it looks like it looks and add those synonyms you may need for your usecase. What synonyms you want to use depends on the things you want to do with your Solr-server. According to this, there won't be any out-of-the-box-synonym-list that fits all your needs. - Mitch -- View this message in context: http://n3.nabble.com/get-synonyms-tp680011p680317.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 1.4 bug? search fails but analyzer indicates a match
Hi Mitch, I am also seeing this locally with the exact same solr.war, solrconfig.xml, and schema.xml running under Jetty, as well as on 2 different production servers with the same content indexed. So this is really weird - this seems to be influenced by the surrounding text: "would be great to have support for Identi.ca on the follow block" fails to match "Identi.ca", but putting the content on its own or in another sentence: "Support Identi.ca" the search matches. More testing suggests the word "for" is the problem. I don't see an exception or error. Could be a problem with how stopwords are removed? -Peter On Sat, Mar 27, 2010 at 1:19 PM, MitchK wrote: > > Hi Peter, > > have you tried to reindex your data and did you do a commit? > If you changed anything, have you restarted your Solr-server? > > I can't understand why this problem occurs, since the example seem to work > at analysis.jsp. > > Kind regards > - Mitch > -- > View this message in context: > http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680313.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Solr 1.4 bug? search fails but analyzer indicates a match
If I empty the stopword file and re-index, all expected matches happen. So maybe that provides a further suggestion of where the problem is. This certainly feels like a Solr bug (or lucene bug?). -Peter On Sat, Mar 27, 2010 at 3:05 PM, Peter Wolanin wrote: > Hi Mitch, > > I am also seeing this locally with the exact same solr.war, > solrconfig.xml, and schema.xml running under Jetty, as well as on 2 > different production servers with the same content indexed. > > So this is really weird - this seems to be influenced by the surrounding text: > > "would be great to have support for Identi.ca on the follow block" > > fails to match "Identi.ca", but putting the content on its own or in > another sentence: > > "Support Identi.ca" > > the search matches. More testing suggests the word "for" is the > problem. I don't see an exception or error. Could be a problem with > how stopwords are removed? > > -Peter > > > On Sat, Mar 27, 2010 at 1:19 PM, MitchK wrote: >> >> Hi Peter, >> >> have you tried to reindex your data and did you do a commit? >> If you changed anything, have you restarted your Solr-server? >> >> I can't understand why this problem occurs, since the example seem to work >> at analysis.jsp. >> >> Kind regards >> - Mitch >> -- >> View this message in context: >> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680313.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > > > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Solr 1.4 bug? search fails but analyzer indicates a match
Peter, if you are right, please outcomment the stopword filter to make clear, that the problem is really a problem of how the stopword filter deletes stopwords. Is the output correct, if you enter "would be great to have support for Identi.ca on the follow block" in the query-label at the analysis.jsp? Can you make a screenshot for this sentence? - Mitch -- View this message in context: http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680530.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 1.4 bug? search fails but analyzer indicates a match
The output on the analysis screen does look correct. Here are 2 screen shots: empty stopwords: http://img.skitch.com/20100327-rcsjdih4bn3y8ahajqa5wjwybd.png standard stopwords: http://img.skitch.com/20100327-1w5ct1wr25jkir4sji8kumefn1.png -Peter On Sat, Mar 27, 2010 at 4:13 PM, MitchK wrote: > > Peter, > > if you are right, please outcomment the stopword filter to make clear, that > the problem is really a problem of how the stopword filter deletes > stopwords. > > Is the output correct, if you enter "would be great to have support for > Identi.ca on the follow block" in the query-label at the analysis.jsp? Can > you make a screenshot for this sentence? > > - Mitch > -- > View this message in context: > http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680530.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Solr 1.4 bug? search fails but analyzer indicates a match
The stopwords stanza looks like: Which is the same as the example schema http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/example/solr/conf/schema.xml changing this to enablePositionIncrements="false" seems to make the searching work as expected. Is it incorrect to have that directive here, or is this a bug? -Peter On Sat, Mar 27, 2010 at 4:25 PM, Peter Wolanin wrote: > The output on the analysis screen does look correct. Here are 2 screen shots: > > empty stopwords: http://img.skitch.com/20100327-rcsjdih4bn3y8ahajqa5wjwybd.png > > standard stopwords: > http://img.skitch.com/20100327-1w5ct1wr25jkir4sji8kumefn1.png > > -Peter > > On Sat, Mar 27, 2010 at 4:13 PM, MitchK wrote: >> >> Peter, >> >> if you are right, please outcomment the stopword filter to make clear, that >> the problem is really a problem of how the stopword filter deletes >> stopwords. >> >> Is the output correct, if you enter "would be great to have support for >> Identi.ca on the follow block" in the query-label at the analysis.jsp? Can >> you make a screenshot for this sentence? >> >> - Mitch >> -- >> View this message in context: >> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680530.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > > > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Solr 1.4 bug? search fails but analyzer indicates a match
Discussing this with Mark Miller in IRC - we are honing in on the problem. Looks as though Identi.ca is treated as phrase query as if I had quoted it like "Identi ca". That phrase search also fails. I had expected that Identi.ca would be the same as Identi ca (i.e. 2 separate tokens, not a phrase). -Peter On Sat, Mar 27, 2010 at 4:32 PM, Peter Wolanin wrote: > The stopwords stanza looks like: > > ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true" > /> > > Which is the same as the example schema > http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/example/solr/conf/schema.xml > > changing this to enablePositionIncrements="false" seems to make the > searching work as expected. Is it incorrect to have that directive > here, or is this a bug? > > -Peter > > > On Sat, Mar 27, 2010 at 4:25 PM, Peter Wolanin > wrote: >> The output on the analysis screen does look correct. Here are 2 screen shots: >> >> empty stopwords: >> http://img.skitch.com/20100327-rcsjdih4bn3y8ahajqa5wjwybd.png >> >> standard stopwords: >> http://img.skitch.com/20100327-1w5ct1wr25jkir4sji8kumefn1.png >> >> -Peter >> >> On Sat, Mar 27, 2010 at 4:13 PM, MitchK wrote: >>> >>> Peter, >>> >>> if you are right, please outcomment the stopword filter to make clear, that >>> the problem is really a problem of how the stopword filter deletes >>> stopwords. >>> >>> Is the output correct, if you enter "would be great to have support for >>> Identi.ca on the follow block" in the query-label at the analysis.jsp? Can >>> you make a screenshot for this sentence? >>> >>> - Mitch >>> -- >>> View this message in context: >>> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680530.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >> >> >> >> -- >> Peter M. Wolanin, Ph.D. >> Momentum Specialist, Acquia. Inc. >> peter.wola...@acquia.com >> > > > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Solr 1.4 bug? search fails but analyzer indicates a match
Created a new issue: https://issues.apache.org/jira/browse/SOLR-1852 further discussion there. -Peter On Sat, Mar 27, 2010 at 5:51 PM, Peter Wolanin wrote: > Discussing this with Mark Miller in IRC - we are honing in on the problem. > > Looks as though Identi.ca is treated as phrase query as if I had > quoted it like "Identi ca". That phrase search also fails. I had > expected that Identi.ca would be the same as Identi ca (i.e. 2 > separate tokens, not a phrase). > > -Peter > > On Sat, Mar 27, 2010 at 4:32 PM, Peter Wolanin > wrote: >> The stopwords stanza looks like: >> >> > ignoreCase="true" >> words="stopwords.txt" >> enablePositionIncrements="true" >> /> >> >> Which is the same as the example schema >> http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/example/solr/conf/schema.xml >> >> changing this to enablePositionIncrements="false" seems to make the >> searching work as expected. Is it incorrect to have that directive >> here, or is this a bug? >> >> -Peter >> >> >> On Sat, Mar 27, 2010 at 4:25 PM, Peter Wolanin >> wrote: >>> The output on the analysis screen does look correct. Here are 2 screen >>> shots: >>> >>> empty stopwords: >>> http://img.skitch.com/20100327-rcsjdih4bn3y8ahajqa5wjwybd.png >>> >>> standard stopwords: >>> http://img.skitch.com/20100327-1w5ct1wr25jkir4sji8kumefn1.png >>> >>> -Peter >>> >>> On Sat, Mar 27, 2010 at 4:13 PM, MitchK wrote: >>>> >>>> Peter, >>>> >>>> if you are right, please outcomment the stopword filter to make clear, that >>>> the problem is really a problem of how the stopword filter deletes >>>> stopwords. >>>> >>>> Is the output correct, if you enter "would be great to have support for >>>> Identi.ca on the follow block" in the query-label at the analysis.jsp? Can >>>> you make a screenshot for this sentence? >>>> >>>> - Mitch >>>> -- >>>> View this message in context: >>>> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680530.html >>>> Sent from the Solr - User mailing list archive at Nabble.com. >>>> >>> >>> >>> >>> -- >>> Peter M. Wolanin, Ph.D. >>> Momentum Specialist, Acquia. Inc. >>> peter.wola...@acquia.com >>> >> >> >> >> -- >> Peter M. Wolanin, Ph.D. >> Momentum Specialist, Acquia. Inc. >> peter.wola...@acquia.com >> > > > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
jmap output help
Hi everyone, The output of "jmap -histo:live 27959 | head -30" is something like the following : num #instances #bytes class name -- 1:448441 180299464 [C 2: 5311 135734480 [I 3: 3623 68389720 [B 4:445669 17826760 java.lang.String 5:391739 15669560 org.apache.lucene.index.TermInfo 6:417442 13358144 org.apache.lucene.index.Term 7: 587675171496 org.apache.lucene.index.FieldsReader$LazyField 8: 329025049760 9: 329023955920 10: 28433512688 11: 23973128048 [Lorg.apache.lucene.index.Term; 12:353053592 [J 13: 33044288 [Lorg.apache.lucene.index.TermInfo; 14: 556712707536 15: 272822701352 [Ljava.lang.Object; 16: 28432212384 17: 23432132224 18: 264241056960 java.util.ArrayList 19: 164231051072 java.util.LinkedHashMap$Entry 20: 20391028944 21: 14336 917504 org.apache.lucene.document.Field 22: 29587 710088 java.lang.Integer 23: 3171 583464 java.lang.Class 24: 813 492880 [Ljava.util.HashMap$Entry; 25: 8471 474376 org.apache.lucene.search.PhraseQuery 26: 4184 402848 [[I 27: 4277 380704 [S Is it ok to assume that the top 3 entries (character/integer/byte arrays) are referring to the entries inside the solr cache? Thanks, -- - Siddhant