help needed on solr-uima integration
Hi, After google online, some parts in the "puzzle" still missing. The best is to find a simple example showing the whole process. Is there any example like apache-uima/examples/descriptors/tutorial/ex3 RoomNumber and DateTime integrated into solr? In particular, how to feed "text" into solr for indexing which has at least two fields? Thanks, Xue-Feng
Re: Implement Custom Soundex
Momo, if you have the conversion text to tokens then all you need to do is implement a custom analyzer, deploy it inside the solr webapp, then plug it into the schema. Is that the part that is hard? I thought the wiki was helpful there but may some other issue is holding you. One zoology of such analyzers is at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters If that is the issue, here's a one sentence explanation: if you have a new analyzer you want to declare a new field-type and field with that analyzer; queries should be going through it as well as indexing. Matching word A with word B will then happen if word A and B are converted by your analyzer to the same token (this is how cat and cats match when using the PorterStemmer for example). paul Le 16 oct. 2011 à 14:09, Momo..Lelo .. a écrit : > > Dear Gora, > > Thank you for the quick response. > > Actually I > need to do Soundex for Arabic language. The code is already done in Java. But > I > couldn't understand how can I implement it as Solr filter. > > Regards, > > > >> From: g...@mimirtech.com >> Date: Sun, 16 Oct 2011 16:19:48 +0530 >> Subject: Re: Implement Custom Soundex >> To: solr-user@lucene.apache.org >> >> 2011/10/16 Momo..Lelo .. : >>> >>> Dear, >>> >>> Does anyone there has an experience of developing a custom Soundex. >>> >>> If you have an experience doing this and can offer some help and share >>> experience I'd really appreciate it. >> >> I presume that this is in the context of Solr, and spell-checking. >> We did this as an exercise for Indian-language words transliterated >> into English, hooking into the open-source spell-checking library, >> aspell, which provided us with a soundex-like algorithm (the actual >> algorithm is quite different, but works better than soundex, at >> least for our use case). We were quite satisfied with the results, >> though unfortunately this never went into production. >> >> Would be glad to help, though I am going to be really busy the >> next few days. Please do provide us with more details on your >> requirements. >> >> Regards, >> Gora >
RE: Implement Custom Soundex
thank you for this information. > Subject: Re: Implement Custom Soundex > From: p...@hoplahup.net > Date: Sun, 23 Oct 2011 10:58:49 +0200 > To: solr-user@lucene.apache.org > > Momo, > > if you have the conversion text to tokens then all you need to do is > implement a custom analyzer, deploy it inside the solr webapp, then plug it > into the schema. > > Is that the part that is hard? > I thought the wiki was helpful there but may some other issue is holding you. > One zoology of such analyzers is at: > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters > > If that is the issue, here's a one sentence explanation: if you have a new > analyzer you want to declare a new field-type and field with that analyzer; > queries should be going through it as well as indexing. Matching word A with > word B will then happen if word A and B are converted by your analyzer to the > same token (this is how cat and cats match when using the PorterStemmer for > example). > > paul > > > Le 16 oct. 2011 à 14:09, Momo..Lelo .. a écrit : > > > > > Dear Gora, > > > > Thank you for the quick response. > > > > Actually I > > need to do Soundex for Arabic language. The code is already done in Java. > > But I > > couldn't understand how can I implement it as Solr filter. > > > > Regards, > > > > > > > >> From: g...@mimirtech.com > >> Date: Sun, 16 Oct 2011 16:19:48 +0530 > >> Subject: Re: Implement Custom Soundex > >> To: solr-user@lucene.apache.org > >> > >> 2011/10/16 Momo..Lelo .. : > >>> > >>> Dear, > >>> > >>> Does anyone there has an experience of developing a custom Soundex. > >>> > >>> If you have an experience doing this and can offer some help and share > >>> experience I'd really appreciate it. > >> > >> I presume that this is in the context of Solr, and spell-checking. > >> We did this as an exercise for Indian-language words transliterated > >> into English, hooking into the open-source spell-checking library, > >> aspell, which provided us with a soundex-like algorithm (the actual > >> algorithm is quite different, but works better than soundex, at > >> least for our use case). We were quite satisfied with the results, > >> though unfortunately this never went into production. > >> > >> Would be glad to help, though I am going to be really busy the > >> next few days. Please do provide us with more details on your > >> requirements. > >> > >> Regards, > >> Gora > > >
Update document field with solrj
I want to edit document filed in solr,for example edit the author name,so i use the following code in solrj: params.set("literal.author","anaconda") but the author multivalued="true" in schema and because of that "anaconde" is not replace with it's previous name and add to the end of the author name, also if i omit the multivalued field or set it to false the bad request exception happen in re-indexing file with new author field,how can i solve this problem and delete or modify the previous document field in solrj? or does it any config i miss in schema? thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Update-document-field-with-solrj-tp3445488p3445488.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: questions about autocommit & committing documents
May someone explain me different use case when both or only one AutoCommit parameters is filled ? I really need to understand it. For example with these configurations : 1 or 1000 or 1 1000 Thanks to everyone -- View this message in context: http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p3445607.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Selective Result Grouping
> The current grouping functionality using group.field is basically > all-or-nothing: all documents will be grouped by the field value or none > will. So there would be no way to, for example, collapse just the videos or > images like they do in google. When using the group.field option values must be the same otherwise they don't get grouped together. Maybe fuzzy grouping would be nice. Grouping videos and images based on mimetype should be easy, right? Videos have a mimetype that start with video/ and images have a mimetype that start with image/. Storing the mime type's subtype and type in separate fields and group on the type field would do the job. Off course you need to know the mimetype during indexing, but solutions like Apache Tika can do that for you. -- Met vriendelijke groet, Martijn van Groningen
Re: Solr indexing plugin: skip single faulty document?
Some work has been done in this general area, see SOLR-445. That might give you some pointers Best Erick On Mon, Oct 17, 2011 at 11:00 AM, samuele.mattiuzzo wrote: > Hi all, as far as i know, when solr finds a faulty document (inside an xml > containing let say 1000 docs) it skips the whole file and the indexing > process exits with exception (am i correct?) > > I'm using a custom indexing plugin, and i can trap the exception. Instead of > using "default" values if that exception is raised, i would like to skip the > document raising the error (example: sometimes i try to insert a string > inside a "string" field, but solr exits saying it's expecting a multiValued > field... i guess it's because of some ascii chars within the text, something > like \n or sort...) maybe logging it somewhere, and pass to the next one. > We're indexing millions of them, and we don't care much if we loose 10-20% > of them, so the best solution is skip the single faulty doc and continue > with the rest. > > I guess i have to work on the super.processAdd() call, but i don't know > where i can find info about it. Can anybody help me? Is there a book talking > about advanced solr plugin developement i could read? > > Thanks! > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-indexing-plugin-skip-single-faulty-document-tp3427646p3427646.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: multiple document types in a core
Yes, stored fields are placed verbatim for every doc. But I wonder at the utility of trying to share stored information. The stored info is put in certain files in the index, see: http://lucene.apache.org/java/3_0_2/fileformats.html#file-names and the files that store data are pretty much irrelevant to searching, the data in them is only referenced when assembling the document for return. So by adding this complexity you'll be saving a bit on file transfers when replicating your index, but not much else. Is it worth it? If so, why? Best Erick On Mon, Oct 17, 2011 at 11:07 AM, lee carroll wrote: > Just as a follow up > > it looks like stored fields are stored verbatim for every doc. > > hotel index and store dest attributes > index size: 131M > number of records 49147 > > hotel index only dest attributes > > index size: 111m > number of records 49147 > > > ~400 chars(bytes) of destination data * 49147 (number of hotel docs) = ~19m > > basically everything is being stored > > No difference in time to index (very rough and not scientific :-) ) > > So it does seem an ok strategy to denormalise docs with index fields > but normalise with stored fields ? > Or have i missed some problems with this ? > > cheers lee c > > > > On 16 October 2011 11:54, lee carroll wrote: >> Hi Chris thanks for the response >> >>> It's an inverted index, so *tems* exist once (per segment) and those terms >>> "point" to the documents -- so having the same terms (in the same fields) >>> for multiple types of documents in one index is going to take up less >>> overall space then having distinct collections for each type of document. >> >> I'm not asking about the indexed terms but rather the stored values. >> By having two doc types are we gaining anything by "storing" >> attributes only for that doc type >> >> cheers lee c >> >
Re: use lucene to create index(with synonym) and solr query index
I'm not quite sure what you're asking, but the values returned for documents to the client are the *stored* values, not the indexed values. So your synonyms will never be returned as part of a document. Does that help? Best Erick On Wed, Oct 19, 2011 at 4:23 AM, cmd wrote: > 1.use lucene to create index(with synonym) > 2.config solr open synonym functionality > 3.user solr to query lucene index but the result missing the synonym word > why? and how can i do with each other. thanks! > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/use-lucene-to-create-index-with-synonym-and-solr-query-index-tp3433124p3433124.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Find Documents with field = maxValue
Right, but consider the general case. You could potentially return every document in your index in a single packet with this functionality. I suspect that this is an edge case that you'll have to 1> implement the two-or-more query solution 2> write your own component that investigates the terms in the field in question and accomplishes your task. Best Erick On Wed, Oct 19, 2011 at 2:40 PM, Alireza Salimi wrote: > What I'm looking for is to do everything in single shot in Solr. > I'm not even sure if it's possible or not. > Finding the max value and then running another query is NOT my ideal > solution. > > Thanks everybody > > > On Tue, Oct 18, 2011 at 6:28 PM, Sujit Pal wrote: > >> Hi Alireza, >> >> Would this work? Sort the results by age desc, then loop through the >> results as long as age == age[0]. >> >> -sujit >> >> On Tue, 2011-10-18 at 15:23 -0700, Otis Gospodnetic wrote: >> > Hi, >> > >> > Are you just looking for: >> > >> > age: >> > >> > This will return all documents/records where age field is equal to target >> age. >> > >> > But maybe you want >> > >> > age:[0 TO ] >> > >> > This will include people aged from 0 to target age. >> > >> > Otis >> > >> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> > Lucene ecosystem search :: http://search-lucene.com/ >> > >> > >> > > >> > >From: Alireza Salimi >> > >To: solr-user@lucene.apache.org >> > >Sent: Tuesday, October 18, 2011 10:15 AM >> > >Subject: Re: Find Documents with field = maxValue >> > > >> > >Hi Ahmet, >> > > >> > >Thanks for your reply, but I want ALL documents with age = max_age. >> > > >> > > >> > >On Tue, Oct 18, 2011 at 9:59 AM, Ahmet Arslan >> wrote: >> > > >> > >> >> > >> >> > >> --- On Tue, 10/18/11, Alireza Salimi >> wrote: >> > >> >> > >> > From: Alireza Salimi >> > >> > Subject: Find Documents with field = maxValue >> > >> > To: solr-user@lucene.apache.org >> > >> > Date: Tuesday, October 18, 2011, 4:10 PM >> > >> > Hi, >> > >> > >> > >> > It might be a naive question. >> > >> > Assume we have a list of Document, each Document contains >> > >> > the information of >> > >> > a person, >> > >> > there is a numeric field named 'age', how can we find those >> > >> > Documents whose >> > >> > *age* field >> > >> > is *max(age) *in one query. >> > >> >> > >> May be http://wiki.apache.org/solr/StatsComponent? >> > >> >> > >> Or sort by age? q=*:*&start=0&rows=1&sort=age desc >> > >> >> > > >> > > >> > > >> > >-- >> > >Alireza Salimi >> > >Java EE Developer >> > > >> > > >> > > >> >> > > > -- > Alireza Salimi > Java EE Developer >
Re: where is solr data import handler looking for my file?
I think you need to back up and state the problem you're trying to solve. Offhand, it looks as though you're trying to do something with DIH that it wasn't intended to do. But that's just a guess since the details of what you're trying to do are so sparse... Best Erick On Wed, Oct 19, 2011 at 10:49 PM, Fred Zimmerman wrote: > Solr dataimport is reporting file not found when it looks for foo.xml. > > Where is it looking for /data? is this an url off the apache2/htdocs on the > server, or is it an URL within example/solr/...? > > > processor="XPathEntityProcessor" > stream="true" > forEach="/mediawiki/page/" > url="/data/foo.xml" > transformer="RegexTransformer,DateFormatTransformer" > > >
Re: Dismax and phrases
Hmmm dismax is, indeed, different. Note that dismax doesn't respect the default operator at all, so don't be mislead there. Could you paste the debug output for both the queries? Perhaps something will jump out at us. Best Erick On Thu, Oct 20, 2011 at 11:08 AM, Hyttinen Lauri wrote: > Thank you Otis for the answer. > > I've played around with the solr admin query interface and I've managed to > confuse myself even more. > If I query without the quotes solr seems to form two parsedqueries > > +((DisjunctionMaxQuery(( -first word stuff- )) DisjunctionMaxQuery(( -second > word stuff- )) > > and then based on the query give out results which have -both- words. > Default operator is OR in schema.xml. > > With quotes the query is different with only one DisjunctionMaxQuery in > parsedquery but the results (of which there are more than double) have pages > in them > which have only one of the words (granted these results are much lower than > the ones with both words) > > I set qs to 0. (and I even played with pf and ps before commenting them out > since they relate to automaticed phrased queries?) > > Best regards, > Lauri > > PS. I am not unhappy with the results so to speak but perplexed and don't > know how to explain this number discrepancy to project members other than > "Dismax is different." > > > On 10/19/2011 04:28 PM, Otis Gospodnetic wrote: >> >> Lauri, >> >> Start with adding&debugQuery=true to your URL calls to Solr and look at >> how the queries are getting rewritten to understand what is going on. What >> you are seeing is actually expected, so if you want your phrase query to be >> a strict phrase query, just use standard request handler, not dismax. >> >> Otis >> >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> Lucene ecosystem search :: http://search-lucene.com/ >> >> >>> >>> From: Hyttinen Lauri >>> To: solr-user@lucene.apache.org >>> Sent: Wednesday, October 19, 2011 5:02 AM >>> Subject: Dismax and phrases >>> >>> Hello, >>> >>> I've inherited a solr-lucene project which I continue to develop. This >>> particular SOLR (1.4.1) uses dismax for the queries but I am getting some >>> results that I do not understand. Mainly when I search for two terms I get >>> some results however when I put quotes around the two terms I get a lot more >>> results which goes against my understanding of what should happen ie. a >>> lesser set of results. Where should I start digging for the answer? >>> solrconfiq.xql or some other place? >>> >>> Best regards, >>> Lauri Hyttinen >>> >>> >>> > > > -- > Lauri Hyttinen > Tietopalvelusuunnittelija > Tilastokeskus > Yksikkö > Käyntiosoite: Työpajankatu 13, 00580 Helsinki > Postiosoite: PL 3 A, 00022 Tilastokeskus > puh. 09 1734 > lauri.hytti...@tilastokeskus.fi > www.tilastokeskus.fi > >
Re: Question about near query order
Just to chime in here... You will get different results for "A B"~2 and "B A"~2. In the simple two-term case, changing the order requires an extra move(s). There's a very good explanation of this in Lucene In Action II. Best Erick On Thu, Oct 20, 2011 at 3:35 PM, Jason, Kim wrote: > Which one is better performance of setting inOrder=false in solrconfig.xml > and quering with "A B"~1 AND "B A"~1 if performance differences? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Question-about-near-query-order-tp3427312p3437701.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: how to handle large relational data in Solr
In addition to Otis' suggestion, think about using multivalued fields with an increment gap of, say, 100 (assuming your accessories had less than 100 fields). Then you can do proximity searches with a size < 100 (e.g. "red swing"~90) would not match across your multiple entries If this is clear as mud, write back with what you've tried and maybe we can help Best Erick On Thu, Oct 20, 2011 at 7:23 PM, Jonathan Carothers wrote: > Actually, that's the root of my concern. It looks like it product will > average ~20,000 associated accessories, still workable, but starting to look > painful. Coming back the other way, I would guess each accessory would be > associated with 100 products on average. > > Given that there would be searchable fields in both the product and accessory > data, I assume I would have to either split them into separate indexes and > merge the results, or have one document per product/accessory combo so that I > don't get a mix of accessories matching the search term. For example, if a > product had two accessories, one with the description of "Blue Swing" and > another with "Red Ball" and I did a search for "Red Swing" it would rank > about the same as a document that actually had a "Red Swing". > > So it sounds like you are suggesting the external map, in which case is there > a good way to merge the two searches? Basically on search on product > attributes and a second search on the attributes of related accessories? > > many thanks, > Jonathan > > From: Robert Stewart [bstewart...@gmail.com] > Sent: Thursday, October 20, 2011 12:05 PM > To: solr-user@lucene.apache.org > Subject: Re: how to handle large relational data in Solr > > If your "documents" are products, then 100,000 documents is a pretty small > index for solr. Do you know approximately how many accessories are related > to each product on average? If # if relatively small (around 100 or less), > then it should be ok to create product documents with all the related > accessories as fields on the document, something like: > > > PRODUCT_ID > PRODUCT_NAME > accessory one > accessory two > > accessory N > > > > And then you can search for products by accessory, and show accessory facets > over products, etc. > > Even if # of accessories per product is large (1000 or more), you can still > do it this way, but it may be better to store some small accessory ID as > integers instead of larger names, and maybe use some external mapping to > resolve names for search and display. > > Bob > > > On Oct 20, 2011, at 11:08 AM, Jonathan Carothers wrote: > >> Agreed, this will just be a read only view of the existing database for >> search purposes. Sorry for the confusion. >> >> From: Brandon Ramirez [brandon_rami...@elementk.com] >> Sent: Thursday, October 20, 2011 10:50 AM >> To: solr-user@lucene.apache.org >> Subject: RE: how to handle large relational data in Solr >> >> I would not recommend removing your relational database altogether. You >> should treat that as your system of record. By replacing it, you are >> forcing Solr to store the unmodified value for everything even when not >> needed. You also lose normalization. And if you ever need to add some >> data to your system that isn't search-related, you have no choice but to add >> it to your search index. >> >> >> Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848 >> Software Engineer II | Element K | www.elementk.com >> >> >> -Original Message- >> From: Jonathan Carothers [mailto:jonathan.caroth...@amentra.com] >> Sent: Thursday, October 20, 2011 10:12 AM >> To: solr-user@lucene.apache.org >> Subject: how to handle large relational data in Solr >> >> All, >> >> We are attempting to convert a fairly large relational database into Solr >> index(es). >> >> There are ~100,000 products with ~1,000,000 accessories that can be related >> to any number of the products. So if I include the search terms and the >> relationships in the same index, we're looking at a pretty huge index. >> >> If we break it out into three indexes, one for the product search, one for >> the accessories search, and one for their relationship, is there a good way >> to merge the results? >> >> Is there a better way to structure the indexes? >> >> We will have a relational database available if it makes sense to do some >> sort of a hybrid approach. >> >> many thanks, >> Jonathan >> > >
Re: OS Cache - Solr
Think about using cores rather than instances if you really must have this kind of separation. Otherwise you might have much better luck combining these into a single index. Best Erick On Fri, Oct 21, 2011 at 7:07 AM, Sujatha Arun wrote: > Yes its same ,we have a base static schema and wherever required we > use dynamic. > > Regards, > Sujatha > > > On Thu, Oct 20, 2011 at 6:26 PM, Jaeger, Jay - DOT > wrote: > >> I wonder. What if, instead of 200 instances, you had one instance, but >> built a uniqueKey up out of whatever you have now plus whatever information >> currently segregates the instances. Then this would be much more >> manageable. >> >> In other words, what is different about each of the 200 instances? Is the >> schema for each essentially the same, as I am guessing? >> >> JRJ >> >> -Original Message- >> From: Sujatha Arun [mailto:suja.a...@gmail.com] >> Sent: Thursday, October 20, 2011 12:21 AM >> To: solr-user@lucene.apache.org >> Cc: Otis Gospodnetic >> Subject: Re: OS Cache - Solr >> >> Yes 200 Individual Solr Instances not solr cores. >> >> We get an avg response time of below 1 sec. >> >> The number of documents is not many most of the isntances ,some of the >> instnaces have about 5 lac documents on average. >> >> Regards >> Sujahta >> >> On Thu, Oct 20, 2011 at 3:35 AM, Jaeger, Jay - DOT > >wrote: >> >> > 200 instances of what? The Solr application with lucene, etc. per usual? >> > Solr cores? ??? >> > >> > Either way, 200 seems to be very very very many: unusually so. Why so >> > many? >> > >> > If you have 200 instances of Solr in a 20 GB JVM, that would only be >> 100MB >> > per Solr instance. >> > >> > If you have 200 instances of Solr all accessing the same physical disk, >> the >> > results are not likely to be satisfactory - the disk head will go nuts >> > trying to handle all of the requests. >> > >> > JRJ >> > >> > -Original Message- >> > From: Sujatha Arun [mailto:suja.a...@gmail.com] >> > Sent: Wednesday, October 19, 2011 12:25 AM >> > To: solr-user@lucene.apache.org; Otis Gospodnetic >> > Subject: Re: OS Cache - Solr >> > >> > Thanks ,Otis, >> > >> > This is our Solr Cache Allocation.We have the same Cache allocation for >> > all >> > our *200+ instances* in the single Server.Is this too high? >> > >> > *Query Result Cache*:LRU Cache(maxSize=16384, initialSize=4096, >> > autowarmCount=1024, ) >> > >> > *Document Cache *:LRU Cache(maxSize=16384, initialSize=16384) >> > >> > >> > *Filter Cache* LRU Cache(maxSize=16384, initialSize=4096, >> > autowarmCount=4096, ) >> > >> > Regards >> > Sujatha >> > >> > On Wed, Oct 19, 2011 at 4:05 AM, Otis Gospodnetic < >> > otis_gospodne...@yahoo.com> wrote: >> > >> > > Maybe your Solr Document cache is big and that's consuming a big part >> of >> > > that JVM heap? >> > > If you want to be able to run with a smaller heap, consider making your >> > > caches smaller. >> > > >> > > Otis >> > > >> > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> > > Lucene ecosystem search :: http://search-lucene.com/ >> > > >> > > >> > > > >> > > >From: Sujatha Arun >> > > >To: solr-user@lucene.apache.org >> > > >Sent: Tuesday, October 18, 2011 12:53 AM >> > > >Subject: Re: OS Cache - Solr >> > > > >> > > >Hello Jan, >> > > > >> > > >Thanks for your response and clarification. >> > > > >> > > >We are monitoring the JVM cache utilization and we are currently using >> > > about >> > > >18 GB of the 20 GB assigned to JVM. Out total index size being abt >> 14GB >> > > > >> > > >Regards >> > > >Sujatha >> > > > >> > > >On Tue, Oct 18, 2011 at 1:19 AM, Jan Høydahl >> > > wrote: >> > > > >> > > >> Hi Sujatha, >> > > >> >> > > >> Are you sure you need 20Gb for Tomcat? Have you profiled using >> > JConsole >> > > or >> > > >> similar? Try with 15Gb and see how it goes. The reason why this is >> > > >> beneficial is that you WANT your OS to have available memory for >> disk >> > > >> caching. If you have 17Gb free after starting Solr, your OS will be >> > able >> > > to >> > > >> cache all index files in memory and you get very high search >> > > performance. >> > > >> With your current settings, there is only 12Gb free for both caching >> > the >> > > >> index and for your MySql activities. Chances are that when you >> backup >> > > >> MySql, the cached part of your Solr index gets flushed from disk >> > caches >> > > and >> > > >> need to be re-cached later. >> > > >> >> > > >> How to interpret memory stats vary between OSes, and seing 163Mb >> free >> > > may >> > > >> simply mean that your OS has used most RAM for various caches and >> > > paging, >> > > >> but will flush it once an application asks for more memory. Have you >> > > seen >> > > >> http://wiki.apache.org/solr/SolrPerformanceFactors ? >> > > >> >> > > >> You should also slim down your index maximally by setting >> stored=false >> > > and >> > > >> indexed=false wherever possible. I would also upgrade to a more >> > curr
Question about dismax and score boost with date
Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 precisionStep="6" positionIncrementGap="0"/> stored="false" omitNorms="true" required="false" omitTermFreqAndPositions="true" /> I am using 'created' as the name of the date field. My dates are being populated as such : 1980-01-01T00:00:00Z Search handler (solrconfig) : dismax explicit 0.1 name0^2 other ^1 name0^2 other ^1 3 3 *:* -- Query : /solr/ftf/dismax/?q=libya &debugQuery=off &hl=true &start= &rows=10 -- I am trying to factor in created to the SCORE. (boost) I have tried a million ways to do this, no success. I know the dates are populating correctly because I can sort by them. Can anyone help me implement date boosting with dismax under this scenario??? -Craig
Re: inconsistent results when faceting on multivalued field
I think the key here is you are a bit confused about what the multiValued thing is all about. The fq clause says, essentially, "restrict all my search results to the documents where 1213206 occurs in sou_codeMetier. That's *all* the fq clause does. Now, by saying facet.field=sou_codeMetier you're asking Solr to count the number of documents that exist for each unique value in that field. A single document can be counted many times. Each "bucket" is a unique value in the field. On the other hand, saying facet.query=sou_codeMetier:[1213206 TO 1213206] you're asking Solr to count all the documents that make it through your query (*:* in this case) with *any* value in the indicated range. Facet queries really have nothing to do with filter queries. That is, facet queries in no way restrict the documents that are returned, they just indicate ways of counting documents into buckets Best Erick On Fri, Oct 21, 2011 at 10:01 AM, Darren Govoni wrote: > My interpretation of your results are that your FQ found 1281 documents > with 1213206 value in sou_codeMetier field. Of those results, 476 also > had 1212104 as a value...and so on. Since ALL the results will have > the field value in your FQ, then I would expect the "other" values to > be equal or less occurring from the result set, which they appear to be. > > > > On 10/21/2011 03:55 AM, Alain Rogister wrote: >> >> Pravesh, >> >> Not exactly. Here is the search I do, in more details (different field >> name, >> but same issue). >> >> I want to get a count for a specific value of the sou_codeMetier field, >> which is multivalued. I expressed this by including a fq clause : >> >> >> /select/?q=*:*&facet=true&facet.field=sou_codeMetier&fq=sou_codeMetier:1213206&rows=0 >> >> The response (excerpt only): >> >> >> >> 1281 >> 476 >> 285 >> 260 >> 208 >> 171 >> 152 >> ... >> >> As you see, I get back both the expected results and extra results I would >> expect to be filtered out by the fq clause. >> >> I can eliminate the extra results with a >> 'f.sou_codeMetier.facet.prefix=1213206' clause. >> >> But I wonder if Solr's behavior is correct and how the fq filtering works >> exactly. >> >> If I replace the facet.field clause with a facet.query clause, like this: >> >> /select/?q=*:*&facet=true&facet.query=sou_codeMetier:[1213206 TO >> 1213206]&rows=0 >> >> The results contain a single item: >> >> >> 1281 >> >> >> The 'fq=sou_codeMetier:1213206' clause isn't necessary here and does not >> affect the results. >> >> Thanks, >> >> Alain >> >> On Fri, Oct 21, 2011 at 9:18 AM, pravesh wrote: >> >>> Could u clarify on below: > > When I make a search on facet.qua_code=1234567 ?? >>> >>> Are u trying to say, when u fire a fresh search for a facet item, like; >>> q=qua_code:1234567?? >>> >>> This this would fetch for documents where qua_code fields contains either >>> the terms 1234567 OR both terms (1234567& 9384738.and others terms). >>> This would be since its a multivalued field and hence if you see the >>> facet, >>> then its shown for both the terms. >>> > If I reword the query as 'facet.query=qua_code:1234567 TO 1234567', I >>> >>> only >>> get the expected counts >>> >>> You will get facet for documents which have term 1234567 only >>> (facet.query >>> would apply to the facets,so as to which facet to be picked/shown) >>> >>> Regds >>> Pravesh >>> >>> >>> >>> -- >>> View this message in context: >>> >>> http://lucene.472066.n3.nabble.com/inconsistent-results-when-faceting-on-multivalued-field-tp3438991p3440128.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> > >
Re: SOLRNET combine LocalParams with SolrMultipleCriteriaQuery?
Hmmm, this is the Java forum, you might get a faster respons on the Solr .net users list Especially since I don't find any reference to SolrMultipleCriteriaQuery in the Java 3.x code Best Erick On Fri, Oct 21, 2011 at 1:44 PM, Grüger, Joscha wrote: > Hello, > > does anybody know how to combine SolrMultipleCriteriaQuery and LocalParams > (in SOLRnet)? > > I've tried things like that (don't worry about bad the code, it's just to > test) > > var test = solr.Query(BuildQuery(parameters), new QueryOptions > { > FilterQueries = bq(), > Facet = new FacetParameters > { > Queries = new[] { > new SolrFacetFieldQuery(new LocalParams {{"ex", "dt"}} + > "ju_success") , new SolrFacetFieldQuery(new LocalParams {{"ex", "dt"}} + > "dr_success") > } > } > }); > ... > > public ICollection bq() > { > List i = new List(); > i.Add(new LocalParams { { "tag", "dt" } } + > Query.Field("dr_success").Is("simple")); > List MultiListItems = new List(); > var t = new SolrMultipleCriteriaQuery(i, "OR"); > MultiListItems.Add(t); > return MultiListItems(); > } > > > > What I try to do are multi-select-facets with a "OR" operator. > > Thanks for all the help! > > Grüger > >
Re: Can Solr handle large text files?
Also be aware that by default Solr is configured to only index the first 10,000 lines of text. See maxFieldLength in solrconfig.xml Best Erick On Fri, Oct 21, 2011 at 7:34 PM, Peter Spam wrote: > Thanks for your note, Anand. What was the maximum chunk size for you? Could > you post the relevant portions of your configuration file? > > > Thanks! > Pete > > On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote: > >> Hi, >> >> I was also facing the issue of highlighting the large text files. I applied >> the solution proposed here and it worked. But I am getting following error : >> >> >> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I >> get this file from. Its reference is present in browse.vm >> >> >> #if($response.response.get('grouped')) >> #foreach($grouping in $response.response.get('grouped')) >> #parse("hitGrouped.vm") >> #end >> #else >> #foreach($doc in $response.results) >> #parse("hit.vm") >> #end >> #end >> >> >> >> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or >> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', >> cwd=C:\glassfish3\glassfish\domains\domain1\config >> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath >> or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', >> cwd=C:\glassfish3\glassfish\domains\domain1\config at >> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268) >> at >> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42) >> at org.apache.velocity.Template.process(Template.java:98) at >> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446) >> at >> >> Thanks & Regards, >> Anand >> Anand Nigam >> RBS Global Banking & Markets >> Office: +91 124 492 5506 >> >> >> -Original Message- >> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] >> Sent: 21 October 2011 14:58 >> To: solr-user@lucene.apache.org >> Subject: Re: Can Solr handle large text files? >> >> Hi Peter, >> >> highlighting in large text files can not be fast without dividing the >> original text in small piece. >> So take a look in >> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking >> and in >> http://www.lucidimagination.com/blog/2010/09/16/2446/ >> >> Which means that you should divide your files and use Result Grouping / >> Field Collapsing to list only one hit per original document. >> >> (xtf also would solve your problem "out of the box" but xtf does not use >> solr). >> >> Best regards >> Karsten >> >> Original-Nachricht >>> Datum: Thu, 20 Oct 2011 17:59:04 -0700 >>> Von: Peter Spam >>> An: solr-user@lucene.apache.org >>> Betreff: Can Solr handle large text files? >> >>> I have about 20k text files, some very small, but some up to 300MB, >>> and would like to do text searching with highlighting. >>> >>> Imagine the text is the contents of your syslog. >>> >>> I would like to type in some terms, such as "error" and "mail", and >>> have Solr return the syslog lines with those terms PLUS two lines of >>> context. >>> Pretty much just like Google's highlighting. >>> >>> 1) Can Solr handle this? I had extremely long query times when I >>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.). I >>> tried breaking the files into 1MB pieces, but searching would be wonky >>> => return the wrong number of documents (ie. if one file had a term 5 >>> times, and that was the only file that had the term, I want 1 result, not 5 >>> results). >>> >>> 2) What sort of tokenizer would be best? Here's what I'm using: >>> >>> >> multiValued="false" termVectors="true" termPositions="true" >>> termOffsets="true" /> >>> >>> >>> >>> >>> >>> >> generateWordParts="0" generateNumberParts="0" catenateWords="0" >>> catenateNumbers="0" >>> catenateAll="0" splitOnCaseChange="0"/> >>> >>> >>> >>> >>> Thanks! >>> Pete >> >> *** >> The Royal Bank of Scotland plc. Registered in Scotland No 90312. >> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. >> Authorised and regulated by the Financial Services Authority. The >> Royal Bank of Scotland N.V. is authorised and regulated by the >> De Nederlandsche Bank and has its seat at Amsterdam, the >> Netherlands, and is registered in the Commercial Register under >> number 33002587. Registered Office: Gustav Mahlerlaan 350, >> Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and >> The Royal Bank of Scotland plc are authorised to act as agent for each >> other in certain jurisdictions. >> >> This e-mail message is confidential and for use by the addressee only. >> If the message is received by anyone other than the addressee, please >> return the message to the sender by replying to it and then delete the >> message from your computer. Inter
Re: Date boosting with dismax question
Have you seen this? http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents Best Erick On Sat, Oct 22, 2011 at 3:26 AM, Craig Stadler wrote: > Solr Specification Version: 1.4.0 > Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 > 12:33:40 > Lucene Specification Version: 2.9.1 > Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 > > precisionStep="6" positionIncrementGap="0"/> > > stored="false" omitNorms="true" required="false" > omitTermFreqAndPositions="true" /> > > I am using 'created' as the name of the date field. > > My dates are being populated as such : > 1980-01-01T00:00:00Z > > Search handler (solrconfig) : > > > > dismax > explicit > 0.1 > name0^2 other ^1 > name0^2 other ^1 > 3 > 3 > *:* > > > > -- > > Query : > > /solr/ftf/dismax/?q=libya > &debugQuery=off > &hl=true > &start= > &rows=10 > -- > > I am trying to factor in created to the SCORE. (boost) I have tried a > million ways to do this, no success. I know the dates are populating > correctly because I can sort by them. Can anyone help me implement date > boosting with dismax under this scenario??? > > -Craig >
Re: Date boosting with dismax question
Yes I have and I cannot get it to work. Perhaps something is out of version for my setup? I tried for 3 hours to get ever example I could find to work. - Original Message - From: "Erick Erickson" To: Sent: Sunday, October 23, 2011 5:07 PM Subject: Re: Date boosting with dismax question Have you seen this? http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents Best Erick On Sat, Oct 22, 2011 at 3:26 AM, Craig Stadler wrote: Solr Specification Version: 1.4.0 Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 12:33:40 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 I am using 'created' as the name of the date field. My dates are being populated as such : 1980-01-01T00:00:00Z Search handler (solrconfig) : dismax explicit 0.1 name0^2 other ^1 name0^2 other ^1 3 3 *:* -- Query : /solr/ftf/dismax/?q=libya &debugQuery=off &hl=true &start= &rows=10 -- I am trying to factor in created to the SCORE. (boost) I have tried a million ways to do this, no success. I know the dates are populating correctly because I can sort by them. Can anyone help me implement date boosting with dismax under this scenario??? -Craig
Re: Update document field with solrj
You cannot update a single field in a document in Solr, you need to replace the entire document. multiValued is irrelevant to this problem.. Or did I misunderstand your problem? Best Erick On Sun, Oct 23, 2011 at 1:32 PM, hadi wrote: > I want to edit document filed in solr,for example edit the author name,so i > use the following code in solrj: > > params.set("literal.author","anaconda") > > but the author multivalued="true" in schema and because of that "anaconde" > is not replace with it's previous name and add to the end of the author > name, > also if i omit the multivalued field or set it to false the bad request > exception happen in re-indexing file with new author field,how can i solve > this problem and delete or modify the previous document field in solrj? or > does it any config i miss in schema? thanks > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Update-document-field-with-solrj-tp3445488p3445488.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: questions about autocommit & committing documents
A full commit of all pending documents is performed whenever the first trigger is reached. So, maxdocs = 1000. Max time=1 minute. Index a packet with 999 docs. Index another packet with 50 documents immediately after. One commit of 1049 documents happens Index a packet of 999 docs. Do nothing for a minute. One commit of 999 docs happens because of maxtime... But I have to ask, "why do you care"? What high level problem are you trying to handle? Best Erick On Sun, Oct 23, 2011 at 3:03 PM, darul wrote: > May someone explain me different use case when both or only one AutoCommit > parameters is filled ? > > I really need to understand it. > > For example with these configurations : > > > 1 > > > or > > > 1000 > > > or > > > 1 > 1000 > > > Thanks to everyone > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p3445607.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Date boosting with dismax question
Define "not working". Show what you're getting and what you expect to find. Show your data. Note that the example given boosts on quite coarse dates, it *tends* to make documents published in a particular *year* score higher. You might review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Sun, Oct 23, 2011 at 11:08 PM, Craig Stadler wrote: > Yes I have and I cannot get it to work. Perhaps something is out of version > for my setup? > I tried for 3 hours to get ever example I could find to work. > > - Original Message - From: "Erick Erickson" > > To: > Sent: Sunday, October 23, 2011 5:07 PM > Subject: Re: Date boosting with dismax question > > > Have you seen this? > > http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents > > Best > Erick > > > On Sat, Oct 22, 2011 at 3:26 AM, Craig Stadler > wrote: >> >> Solr Specification Version: 1.4.0 >> Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 >> 12:33:40 >> Lucene Specification Version: 2.9.1 >> Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 >> >> > precisionStep="6" positionIncrementGap="0"/> >> >> > stored="false" omitNorms="true" required="false" >> omitTermFreqAndPositions="true" /> >> >> I am using 'created' as the name of the date field. >> >> My dates are being populated as such : >> 1980-01-01T00:00:00Z >> >> Search handler (solrconfig) : >> >> >> >> dismax >> explicit >> 0.1 >> name0^2 other ^1 >> name0^2 other ^1 >> 3 >> 3 >> *:* >> >> >> >> -- >> >> Query : >> >> /solr/ftf/dismax/?q=libya >> &debugQuery=off >> &hl=true >> &start= >> &rows=10 >> -- >> >> I am trying to factor in created to the SCORE. (boost) I have tried a >> million ways to do this, no success. I know the dates are populating >> correctly because I can sort by them. Can anyone help me implement date >> boosting with dismax under this scenario??? >> >> -Craig >> > >
Re: where is solr data import handler looking for my file?
Figured it out. See step 12 in http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-and-index-wikipedia-with-solr/. Thanks! On Sun, Oct 23, 2011 at 1:31 PM, Erick Erickson wrote: > I think you need to back up and state the problem you're trying to > solve. Offhand, it looks as though you're trying to do something > with DIH that it wasn't intended to do. But that's just a guess > since the details of what you're trying to do are so sparse... > > Best > Erick > > On Wed, Oct 19, 2011 at 10:49 PM, Fred Zimmerman > wrote: > > Solr dataimport is reporting file not found when it looks for foo.xml. > > > > Where is it looking for /data? is this an url off the apache2/htdocs on > the > > server, or is it an URL within example/solr/...? > > > > > > >processor="XPathEntityProcessor" > >stream="true" > >forEach="/mediawiki/page/" > >url="/data/foo.xml" > >transformer="RegexTransformer,DateFormatTransformer" > >> > > >
schema.xml bloat?
Hi, it seems from my limited experience thus far that as new data types are added, schema.xml will tend to become bloated with many different field and fieldtype definitions. Is this a problem in real life, and if so, what strategies are used to address it? FredZ
Re: schema.xml bloat?
On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote: > it seems from my limited experience thus far that as new data types are > added, schema.xml will tend to become bloated with many different field and > fieldtype definitions. Is this a problem in real life, and if so, what > strategies are used to address it? ... by keeping your schema lean and clean, only with what YOU need in it. Granted, I'd personally keep all the built-in Solr primitive field types defined even if I didn't use them, but there aren't very many and don't really clutter things up. Defined fields should ONLY be what you need for your application, and generally that should be a tractable (and necessary) reasonably sized set. Erik
questions on query format
Hi, I've spent quite some time reading up on the query format and can't seem to solve this problem: 1. If send solr the following query: q={!lucene}profile_description:* I get what I would expect. 2. If send solr the following query: q=*:* I get nothing just: Would appreciate some insight into what is going on. Thanks.
Re: schema.xml bloat?
So, basically, yes, it is a real problem and there is no designed solution? e.g. optional sub-schema files that can be turned off and on? On Sun, Oct 23, 2011 at 6:38 PM, Erik Hatcher wrote: > > On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote: > > it seems from my limited experience thus far that as new data types are > > added, schema.xml will tend to become bloated with many different field > and > > fieldtype definitions. Is this a problem in real life, and if so, what > > strategies are used to address it? > > ... by keeping your schema lean and clean, only with what YOU need in it. > Granted, I'd personally keep all the built-in Solr primitive field types > defined even if I didn't use them, but there aren't very many and don't > really clutter things up. > > Defined fields should ONLY be what you need for your application, and > generally that should be a tractable (and necessary) reasonably sized set. > >Erik >
Re: schema.xml bloat?
On Oct 23, 2011, at 20:23 , Fred Zimmerman wrote: > So, basically, yes, it is a real problem and there is no designed solution? Hmmm problem? Not terribly so, is it? Certainly I'm more for a de-XMLification of configuration myself though. And we probably should bake-in all the basic field types so they aren't explicitly declared (but could still be overridden if desired). > e.g. optional sub-schema files that can be turned off and on? Hmmm... you can use XInclude stuff. Not sure that gives you the "optional" part of things exactly, but there is also the ${sys.prop.name[:default_value]} syntax that can be used in the configuration to pull off conditional types of tricks in some cases. Right, so no designed solution to the problem, I suppose. It's what it is at this point. I'm curious to hear more elaboration on the specifics of how this is a problem though. Certainly there is much room for improvement in most things. Erik > > On Sun, Oct 23, 2011 at 6:38 PM, Erik Hatcher wrote: > >> >> On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote: >>> it seems from my limited experience thus far that as new data types are >>> added, schema.xml will tend to become bloated with many different field >> and >>> fieldtype definitions. Is this a problem in real life, and if so, what >>> strategies are used to address it? >> >> ... by keeping your schema lean and clean, only with what YOU need in it. >> Granted, I'd personally keep all the built-in Solr primitive field types >> defined even if I didn't use them, but there aren't very many and don't >> really clutter things up. >> >> Defined fields should ONLY be what you need for your application, and >> generally that should be a tractable (and necessary) reasonably sized set. >> >> Erik >>
data-import problem
Hi, I am trying to comfigure solr on aws ubuntu instance.I have mysql on a different server.so i created a ssh tunnel for mysql on port 3309. Download the mysql jdbc driver and copied it to lib folder. *I edited the example/solr/conf/solrconfig.xml* data-config.xml *example/solr/conf/data-config.xml* *started the server* java -Djetty.port=80 -jar start.jar *when i tried to import data.* http:///solr/dataimport?command=fullimport i* am getting the following response* 05data-config.xmlfullimportidleThis response format is experimental. It is likely to change in the future. Can someone help me on this?Also where can i find the logs. Thanks and Regards, Radha Krishna.
Re: questions on query format
> 2. If send solr the following query: > q=*:* > > I get nothing just: > name="response" numFound="0" start="0" > maxScore="0.0"/> name="highlighting"/> > > Would appreciate some insight into what is going on. If you are using dismax as query parser, then *:* won't function as match all docs query. To retrieve all docs - with dismax - use q.alt=*:* parameter. Also, adding debugQuery=on will display information about parsed query.
Re: Dismax and phrases
On 10/23/2011 09:34 PM, Erick Erickson wrote: Hmmm dismax is, indeed, different. Note that dismax doesn't respect the default operator at all, so don't be mislead there. Could you paste the debug output for both the queries? Perhaps something will jump out at us. Best Erick Thank you Erick. I've tried to paste the query results here. First one is the query with ""'s around the terms and returns 6888 results. I've hid the explain parts of most of the results (and timing) just to keep the email reasonably short. If you need to see them let me know. + designates hidden "subtree". Best regards, Lauri 0 91 on standard 2.2 10 *,score on 0 "asuntojen hinnat" dismax + asuntojenhinnat "asuntojen hinnat" "asuntojen hinnat" +DisjunctionMaxQuery((table.title_t:"asuntojen hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" | (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto table.description_fi:hinta) | table.description_t:"asuntojen hinnat" | graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" | text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) | (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto title_fi:hinta)^2.0))~0.01) () type:tie^6.0 type:kuv^2.0 type:tau^2.0 FunctionQuery((1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0) +(table.title_t:"asuntojen hinnat"^2.0 | title_t:"asuntojen hinnat"^2.0 | ingress_t:"asuntojen hinnat" | (text_fi:asunto text_fi:hinta) | (table.description_fi:asunto table.description_fi:hinta) | table.description_t:"asuntojen hinnat" | graphic.title_t:"asuntojen hinnat"^2.0 | ((graphic.title_fi:asunto graphic.title_fi:hinta)^2.0) | ((table.title_fi:asunto table.title_fi:hinta)^2.0) | table.contents_t:"asuntojen hinnat" | text_t:"asuntojen hinnat" | (ingress_fi:asunto ingress_fi:hinta) | (table.contents_fi:asunto table.contents_fi:hinta) | ((title_fi:asunto title_fi:hinta)^2.0))~0.01 () type:tie^6.0 type:kuv^2.0 type:tau^2.0 (1.0/(3.16E-11*float(ms(const(1319437912691),date(date.modified_dt)))+1.0))^100.0 name="/media/nss/DATA2/data/wwwprod/til/ashi/2011/07/ashi_2011_07_2011-08-26_tie_001_fi.html"> 3.1653805 = (MATCH) sum of: 1.9299976 = (MATCH) max plus 0.01 times others of: 1.9211313 = weight(title_t:"asuntojen hinnat"^2.0 in 5891), product of: 0.26658234 = queryWeight(title_t:"asuntojen hinnat"^2.0), product of: 2.0 = boost 14.413042 = idf(title_t: asuntojen=250 hinnat=329) 0.009247955 = queryNorm 7.206521 = fieldWeight(title_t:"asuntojen hinnat" in 5891), product of: 1.0 = tf(phraseFreq=1.0) 14.413042 = idf(title_t: asuntojen=250 hinnat=329) 0.5 = fieldNorm(field=title_t, doc=5891) 0.03292808 = (MATCH) sum of: 0.016520109 = (MATCH) weight(text_fi:asunto in 5891), product of: 0.044221584 = queryWeight(text_fi:asunto), product of: 4.781769 = idf(docFreq=3251, maxDocs=142742) 0.009247955 = queryNorm 0.3735757 = (MATCH) fieldWeight(text_fi:asunto in 5891), product of: 1.0 = tf(termFreq(text_fi:asunto)=1) 4.781769 = idf(docFreq=3251, maxDocs=142742) 0.078125 = fieldNorm(field=text_fi, doc=5891) 0.016407972 = (MATCH) weight(text_fi:hinta in 5891), product of: 0.03705935 = queryWeight(text_fi:hinta), product of: 4.0073023 = idf(docFreq=7054, maxDocs=142742) 0.009247955 = queryNorm 0.44274852 = (MATCH) fieldWeight(text_fi:hinta in 5891), product of: 1.4142135 = tf(termFreq(text_fi:hinta)=2) 4.0073023 = idf(docFreq=7054, maxDocs=142742) 0.078125 = fieldNorm(field=text_fi, doc=5891) 0.34379265 = (MATCH) sum of: 0.19207533 = (MATCH) weight(graphic.title_fi:asunto in 5891), product of: 0.10662244 = queryWeight(graphic.title_fi:asunto), product of: 5.76465 = idf(docFreq=1216, maxDocs=142742) 0.01849591 = queryNorm 1.8014531 = (MATCH) fieldWeight(graphic.title_fi:asunto in 5891), product of: 1.0 = tf(termFreq(graphic.title_fi:asunto)=1) 5.76465 = idf(docFreq=1216, maxDocs=142742) 0.3125 = fieldNorm(field=graphic.title_fi, doc=5891) 0.15171732 = (MATCH) weight(graphic.title_fi:hinta in 5891), product of: 0.09476117 = queryWeight(graphic.title_fi:hinta), product of: 5.1233582 = idf(docFreq=2310, maxDocs=142742) 0.01849591 = queryNorm 1.6010494 = (MATCH) fieldWeight(graphic.title_fi:hinta in 5891), product of: 1.0 = tf(termFreq(graphic.title_fi:hinta)=1) 5.1233582 = idf(docFreq=2310, maxDocs=142742) 0.3125 = fieldNorm(field=graphic.title_fi, doc=5891) 0.5099132 = (MATCH) sum of: 0.302103 = (MATCH) weight(title_fi:asunto in 5891), product of:
Re: Want to support "did you mean xxx" but is Chinese
Hi Li Li, Thanks for your detail explanation. Basically I have similar implementation like yours. I just want to know if there is a better and total solution. I'll keep trying and see if I have any improvement that can share with you and the community. Any idea or advice are welcome . Floyd 2011/10/21 Li Li : >we have implemented one supporting "did you mean" and preffix suggestion > for Chinese. But we base our working on solr 1.4 and we did many > modifications so it will cost time to integrate it to current solr/lucene. > > Here are our solution. glad to see any advices. > > 1. offline words and phrases discovery. > we discovery new words and new phrases by mining query logs > > 2. online matching algorithm > for each word, e.g., 贝多芬 > we convert it to pinyin bei duo fen, then we indexing it using > n-gram, which means gram3:bei gram3:eid ... > to get "did you mean" result, we convert query 背朵分 into n-gram, > it's a boolean or query, so there are many results( the words' pinyin > similar to query will be ranked top) > Then we reranks top 500 results by fine-grained algorithm > we use edit distance to align query and result, we also take > character into consideration. e.g query 十度,matches are 十渡 and 是度,their > pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in > both query and match > also you need consider the hotness(popular degree) of different > words/phrases. which can be known from query logs > > Another question is to convert Chinese into pinyin. because some > character has more than one pinyin. > e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and > words/phrases first. word segmentation is a basic problem is Chinese IR > > > 2011/10/21 Floyd Wu > >> Does anybody know how to implement this idea in SOLR. Please kindly >> point me a direction. >> >> For example, when user enter a keyword in Chinese "��多芬" (this is >> Beethoven in Chinese) >> but key in a wrong combination of characters "背多分" (this is >> pronouncation the same with previous keyword "��多芬"). >> >> There in solr index exist token "��多芬" actually. How to hit documents >> where "��多芬" exist when "背多分" is enter. >> >> This is basic function of commercial search engine especially in >> Chinese processing. I wonder how to implements in SOLR and where is >> the start point. >> >> Floyd >> >