Re: How to extend IndexSchema and SchemaField
Hi Chris, I have opened an issue (SOLR-2146 [1]) following that discussion. [1] https://issues.apache.org/jira/browse/SOLR-2146 cheers -- Renaud Delbru On 14/09/10 01:06, Chris Hostetter wrote: : Yes, I have thought of that, or even extending field type. But this does not : work for my use case, since I can have multiple fields of a same type : (therefore with the same field type, and same analyzer), but each one of them : needs specific information. Therefore, I think the only "nice" way to achieve : this is to have the possibility to add attributes to any field definition. Right, at the moment custom FieldType classes can specify whatever attributes they want to use in the declaration -- but it's not possible to specify arbitrary attributes that can be used in the declaration. By all means, pelase open an issue requesting this as a feature. I don't know that anyone explicitly set out to impose this limitation, but one of the reasons it likely exists is because SchemaField is not something that is intended to be customized -- while FieldType objects are constructed once at startup, SchemaField obejcts are frequently created on the fly when dealing with dynamicFields, so initialization complexity is kept to a minimum. That said -- this definitely seems like that type of usecase that we should try to find *some* solution for -- even if it just means having Solr automaticly create hidden FieldType instances for you on startup based on attributes specified in the that the corrisponding FieldType class understands. -Hoss -- http://lucenerevolution.org/ ... October 7-8, Boston http://bit.ly/stump-hoss ... Stump The Chump!
How to enable solr MoreLikeThis
Hello all, I am trying to follow the documentation given here http://wiki.apache.org/solr/MoreLikeThisHandler to enable MoreLikeThis in my application. However, when I execute any URL such as the one below /solr/mlt?stream.body=social media in india&mlt.fl=content&mlt.mintf=0 I get a 404 error with the following message *type* Status report *message* _/solr/mlt_ *description* _The requested resource (/solr/mlt) is not available. _ _ _I have made all change in solrconfig.xml to enable the mlt component. My solr version is 1.4.0 833479. When I run regular /select/ queries, I get results as expected. Any help on this would be appreciated greatly. I am not sure what else I need to do to make this work. kind regards, Titash
Re: How to enable solr MoreLikeThis
--- On Sat, 10/9/10, Titash Neogi wrote: > From: Titash Neogi > Subject: How to enable solr MoreLikeThis > To: "solr-user@lucene.apache.org" > Date: Saturday, October 9, 2010, 3:28 PM > Hello all, > > I am trying to follow the documentation given here > http://wiki.apache.org/solr/MoreLikeThisHandler to > enable MoreLikeThis in my application. However, when I > execute any URL such as the one below > > /solr/mlt?stream.body=social media in > india&mlt.fl=content&mlt.mintf=0 > > I get a 404 error with the following message > > *type* Status report > > *message* _/solr/mlt_ > > *description* _The requested resource (/solr/mlt) is not > available. > _ > > _ > _I have made all change in solrconfig.xml to enable the mlt > component. My solr version is 1.4.0 833479. When I run > regular /select/ queries, I get results as expected. > Any help on this would be appreciated greatly. I am not > sure what else I need to do to make this work. /solr/mlt? syntax is about RequestHandler. You need to registered a request handler named mlt in solrconfig.xml list
Re: Speeding up solr indexing
Related. Can't be larger than -Xmx. :) Or even equal to -Xmx, because other things need to live in the heap. There is no exact function, so be more on the conservative side in order to avoid OOME. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Dennis Gearon > To: solr-user@lucene.apache.org > Sent: Sat, October 9, 2010 12:58:18 AM > Subject: Re: Speeding up solr indexing > > How does that have to work with Java's memory? > > In lockstep, a certain percentage, not related, what, or at all? > > > Dennis Gearon > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a >better idea to learn from others’ mistakes, so you do not have to make them >yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > EARTH has a Right To Life, > otherwise we all die. > > > --- On Fri, 10/8/10, Otis Gospodnetic wrote: > > > From: Otis Gospodnetic > > Subject: Re: Speeding up solr indexing > > To: solr-user@lucene.apache.org > > Date: Friday, October 8, 2010, 9:13 PM > > Hi, > > > > Assuming your DB/network/something else is not the > > bottleneck, increase your > > ramBufferSizeMB (in solrconfig). > > > > Otis > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > - Original Message > > > From: sivaprasad > > > To: solr-user@lucene.apache.org > > > Sent: Fri, October 8, 2010 2:59:45 PM > > > Subject: Speeding up solr indexing > > > > > > > > > Hi, > > > I am indexing the data using DIH.Data coming from > > mysql.Each document > > > contains 30 fields.Some of the fields are multi > > valued.When i am trying to > > > index 10 million records it taking more time to > > index. > > > > > > Any body has suggestions to speed up indexing > > process?Any suggestions on > > > solr admin level configurations? > > > > > > > > > Thanks, > > > JS > > > -- > > > View this message in context: > > >>http://lucene.472066.n3.nabble.com/Speeding-up-solr-indexing-tp1667054p1667054.html > > > > > > > Sent from the Solr - User mailing list archive > > at Nabble.com. > > > > > >
Re: dynamic "stop" words?
That might work, although depending on your use-case it might be hard to have a good controlled vocab on citynames (hotel metropole bruxelles, hotel metropole brussels, hotel metropole brussel, etc.) Also 'hotel paris bruxelles' stinks... given your example: > Doc 1 > name => "Holiday Inn" > city => "Denver" > > Doc 2 > name => "Holiday Inn, Denver" > city => "Denver" > > q=name:(Holiday Inn, Denver) turning it upside down, perhaps an alternative would be to query on: q=name:Holiday Inn+city:Denver and configure field 'name' in such a way that doc1 and doc2 score the same. I believe that must be possible, just not sure how to config it exactly at the moment. Of course, it depends on your scenario if you have enough knowlegde on the clientside to transform: q=name:(Holiday Inn, Denver) to q=name:Holiday Inn+city:Denver Hth, Geert-Jan 2010/10/9 Otis Gospodnetic > Matt, > > The first thing that came to my mind is that this might be interesting to > try > with a dictionary (of city names) if this example is not a made-up one. > > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Matt Mitchell > > To: solr-user@lucene.apache.org > > Sent: Fri, October 8, 2010 11:22:36 AM > > Subject: dynamic "stop" words? > > > > Is it possible to have certain query terms not effect score, if that > > same query term is present in a field? For example, I have an index of > > hotels. Each hotel has a name and city. If the name of a hotel has the > > name of the city in it's "name" field, I want to completely ignore > > that and not have it influence score. > > > > Example: > > > > Doc 1 > > name => "Holiday Inn" > > city => "Denver" > > > > Doc 2 > > name => "Holiday Inn, Denver" > > city => "Denver" > > > > q=name:(Holiday Inn, Denver) > > > > I'd like those docs to have the same score in the response. I don't > > want Doc2 to have a higher score, just because it has all of the query > > terms. > > > > Is this possible without using stop words? I hope this makes sense! > > > > Thanks, > > Matt > > >
Re: Speeding up solr indexing
Hi, Please find the configurations below. Machine configurations(Solr running here): RAM - 4 GB HardDisk - 180GB Os - Red Hat linux version 5 Processor-2x Intel Core 2 Duo CPU @2.66GHz Machine configurations(Mysql server is running here): RAM - 4 GB HardDisk - 180GB Os - Red Hat linux version 5 Processor-2x Intel Core 2 Duo CPU @2.66GHz My sql Server deatils: My sql version - Mysql 5.0.22 Solr configuration details: false 20 100 2147483647 1 1000 1 single false 100 20 2147483647 1 false 10 1 6 Solr document details: 21 fields are indexed and stored 3 fileds are indexed only. 3 fileds are stored only. 3 fileds are indexed,stored and multi valued 2 fileds indexed and multi valued And i am copying some of the indexed fileds.In this 2 fileds are multivalued and has thousands of values. In db-config-file the main table contains 0.6 million records. When i tested for the same records, the index has taken 1hr 30 min.In this case one of the multivalued filed table doesn't have records.After putting data into this table,for each main table record , this table has thousands of records and this filed is indexed and stored.It is taking more than 24 hrs . Solr is running on tomcat 6.0.26, jdk1.6.0_17 and solr 1.4.1 I am using JVM's default settings. Why this is taking this much time?Any body has suggestions, where i am going wrong. Thanks, JS -- View this message in context: http://lucene.472066.n3.nabble.com/Speeding-up-solr-indexing-tp1667054p1670737.html Sent from the Solr - User mailing list archive at Nabble.com.
xml-aware highlighting
I have a requirement to highlight search results, and to display documents with matching terms highlighted in the context of the original XML document structure. It seems like this must be a very common use case, but I am having trouble finding a way to accomplish what we need to do using solr and/or lucene. Using the standard highlighting support in solr, we have been able to retrieve KWIC text fragments for search results, which is great. But what we would ideally like to do is to apply similar highlighting logic while preserving the original document structure. 1) When the user selects a matching document, we render it as HTML with paragraphs, headers, text styles such as italics, and so on, so we need to highlight either the rendered HTML or the original XML and then process that. We need to find the text fragments that matched the original query and highlight those. And this has to use the same logic used by solr/lucene to do the searching, so that the tokenization and analysis is applied properly, and query semantics are respected: if the original query was a phrase query, only phrases should match, and so on. 2) In addition, we also want to be able to display KWIC phrases that are rendered with type styles based on the original XML; this requires some XML tree surgery in order to pull out a fragment of a structured document while preserving enough xml structure to render type styles, which we can do, but it also requires a mapping of matching tokens back into the original document. I am hoping this is a solved problem, but if not, I'd also be interested in pointers to the best places to start an implementation. I think the problem at base is to maintain a map relating positions of matching terms in the indexed and stored field in lucene to corresponding positions in an original XML document. Ideally the original positions could be stored directly in term vectors, but they could also be translated at render/highlight time using an additional lookup. I see code in org.apache.lucene.search.highlight in solr and also something in lucene/contrib/highlighter. Is that the state of the art now, or is there anywhere else I should be looking as well? Thanks for any pointers -Mike Sokolov
Re: xml-aware highlighting
OK - I read a bit more and it appears an appropriate analysis pipeline (which would extract text from XML using SAX, say) is all that's required, and existing highlighting ought to be able to accomplish what I'm after. So I guess the only question I have now before writing code is where is the existing implementation :) - anyone? -Mike On 10/9/2010 12:51 PM, Michael Sokolov wrote: I have a requirement to highlight search results, and to display documents with matching terms highlighted in the context of the original XML document structure. It seems like this must be a very common use case, but I am having trouble finding a way to accomplish what we need to do using solr and/or lucene. Using the standard highlighting support in solr, we have been able to retrieve KWIC text fragments for search results, which is great. But what we would ideally like to do is to apply similar highlighting logic while preserving the original document structure. 1) When the user selects a matching document, we render it as HTML with paragraphs, headers, text styles such as italics, and so on, so we need to highlight either the rendered HTML or the original XML and then process that. We need to find the text fragments that matched the original query and highlight those. And this has to use the same logic used by solr/lucene to do the searching, so that the tokenization and analysis is applied properly, and query semantics are respected: if the original query was a phrase query, only phrases should match, and so on. 2) In addition, we also want to be able to display KWIC phrases that are rendered with type styles based on the original XML; this requires some XML tree surgery in order to pull out a fragment of a structured document while preserving enough xml structure to render type styles, which we can do, but it also requires a mapping of matching tokens back into the original document. I am hoping this is a solved problem, but if not, I'd also be interested in pointers to the best places to start an implementation. I think the problem at base is to maintain a map relating positions of matching terms in the indexed and stored field in lucene to corresponding positions in an original XML document. Ideally the original positions could be stored directly in term vectors, but they could also be translated at render/highlight time using an additional lookup. I see code in org.apache.lucene.search.highlight in solr and also something in lucene/contrib/highlighter. Is that the state of the art now, or is there anywhere else I should be looking as well? Thanks for any pointers -Mike Sokolov
Re: Speeding up solr indexing
Looking at it, and now knowing how much memory your other processes on your box use (nor how much memory you have set aside for Java), I would start with DOUBLING your ram. Make sure that you have enough Java memory. You will know if it has some effect by using the 2:1 size ratio. 100mb for all that data ia pretty small, I think. Use the scientific method; Change only one parameter at a time and check results. It's always on of four things: (in different order depending on task, but listed alphabetically here) -- Memory (process assigned and/or actual physical memory) Processor Network Bandwidth Hard Drive Bandwidth (sometimes you can add motherboard I/O paths also. as of this date, AMD has much more I/O paths in their consumer line of processors.) In order ease of experimenting with(Easiest to hardest): --- Appication/process assigned memory Physical memory Network Bandwidth HardDrive Bandwidth Screaming fast SCSI 15K rpm drives RAID arrays, casual RAID arrays, professional External DRAM drive 64 gig max/RAID them for more Processor(s) Put maximum speed/cache size motherboard will take. Otherwise, USUALLY requires changing motherboard/HOSTING setup I/O channels USUALLY requires changing motherboard/HOSTING setup Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. --- On Sat, 10/9/10, sivaprasad wrote: > From: sivaprasad > Subject: Re: Speeding up solr indexing > To: solr-user@lucene.apache.org > Date: Saturday, October 9, 2010, 8:09 AM > > Hi, > Please find the configurations below. > > Machine configurations(Solr running here): > > RAM - 4 GB > HardDisk - 180GB > Os - Red Hat linux version 5 > Processor-2x Intel Core 2 Duo CPU @2.66GHz > > > > Machine configurations(Mysql server is running here): > RAM - 4 GB > HardDisk - 180GB > Os - Red Hat linux version 5 > Processor-2x Intel Core 2 Duo CPU @2.66GHz > > My sql Server deatils: > My sql version - Mysql 5.0.22 > > Solr configuration details: > > > > > false > > 20 > > > > > > 100 > > 2147483647 > > 1 > > 1000 > > 1 > > > > > > > > > single > > > > > > false > > 100 > 20 > > > > > 2147483647 > > 1 > > false > > > > class="solr.DirectUpdateHandler2"> > > 10 > > 1 > 6 > > > > > > > Solr document details: > > 21 fields are indexed and stored > 3 fileds are indexed only. > 3 fileds are stored only. > 3 fileds are indexed,stored and multi valued > 2 fileds indexed and multi valued > > And i am copying some of the indexed fileds.In this 2 > fileds are multivalued > and has thousands of values. > > In db-config-file the main table contains 0.6 million > records. > > When i tested for the same records, the index has taken 1hr > 30 min.In this > case one of the multivalued filed table doesn't have > records.After putting > data into this table,for each main table record , this > table has thousands > of records and this filed is indexed and stored.It is > taking more than 24 > hrs . > > Solr is running on tomcat 6.0.26, jdk1.6.0_17 and solr > 1.4.1 > > I am using JVM's default settings. > > Why this is taking this much time?Any body has suggestions, > where i am going > wrong. > > Thanks, > JS > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Speeding-up-solr-indexing-tp1667054p1670737.html > Sent from the Solr - User mailing list archive at > Nabble.com. >
Re: multi level faceting
Hi, there are two relative similar solutions for this problem. I will describe one of them: * create a multivalued string field called 'category' * you have a category tree. so make sure a document gets not only the leaf category, but all categories (name or id) until the root * now facet over category with '-1' as limit * What if you want to display only the categories of one level? (e.g. if you don't want other level at a time or if they are too much). then use index the category field ala _category and use facet.prefix to filter the category list * clicking on a category entry should result in a filter query ala fq=category:"selectedCategoryWithLevel" the little tricky part is now that your UI or middle tier has to parse the level e.g. 2 and the append 2+1=3 to the query: facet.prefix=3_ * if you filter the level then one question remains: Q: how can you display the path from the selected category until the root category? A: Either get the category parents via DB (which is easy if you store the category ids in solr) or get the parents from the parameter list which is a bit more complicated (in this case I think it is best to store the category names in solr). (The second approach is: instead of using facet.prefix you can use dynamic fields ala category__s) Did this explaination is missing sth. or unclear? Kind Regards, Peter. > Hi, > > there is a solution without the patch. Here it should be explained: > http://www.lucidimagination.com/blog/2010/08/11/stumped-with-solr-chris-hostetter-of-lucene-pmc-at-lucene-revolution/ > > If not, I will do on 9.10.2010 ;-) > > Regards, > Peter. > > >> I've a similar problem with a project I'm working on now. I am holding out >> for either SOLR-64 or SOLR-792 being a bit more mature before I need the >> functionality but if not I was thinking I could do multi-level faceting by >> indexing the data as a "String" like this: >> >> id: 1 >> SHOE: Sneakers|Men|Size 7 >> >> id: 2 >> SHOE: Sneakers|Men|Size 8 >> >> id: 3 >> SHOE: Sneakers|Women|Size 6 >> >> etc >> >> and then in the UI, show just up to the first delimiter (you'll have to sum >> the counts in the UI too). Once the user clicks on "Sneakers", you would >> then add fq=SHOE:Sneakers|* to the query and then show the values up to the >> 2nd delimiter, etc. >> >> Alternatively, if you didn't want to use a wildcard query, you could index >> each level separately like this: >> >> id: 1 >> SHOE1: Sneakers >> SHOE2: Sneakers|Men >> SHOE3: Sneakers|Men|Size 7 >> >> Then after the user clicks on the 1st level, fq on SHOE1 and show SHOE2, >> etc. This wouldn't work so well if you had more than a few levels in your >> hierarchy. >> >> I haven't actually tried this and like I said I'm hoping I could just use a >> patch (really I hope 3.x gets released GA with the functionality but I won't >> hold my breath...) But I do think this would work in a pinch if need be. >> >> James Dyer >> E-Commerce Systems >> Ingram Content Group >> (615) 213-4311 >> >> >> -Original Message- >> From: Nguyen, Vincent (CDC/OD/OADS) (CTR) [mailto:v...@cdc.gov] >> Sent: Tuesday, October 05, 2010 8:22 AM >> To: solr-user@lucene.apache.org >> Subject: RE: multi level faceting >> >> Just to clarify, the effect I was look for was this. >> >> Sneakers >>Men (22) >>Women (43) >> >> AFTER a user filters by one of those, they would be presented with a NEW >> facet field such as >> >> Sneakers >>Men >> Size 7 (10) >> Size 8 (11) >> Size 9 (23) >> >> Vincent Vu Nguyen >> >> -Original Message- >> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] >> Sent: Monday, October 04, 2010 11:44 PM >> To: solr-user@lucene.apache.org >> Subject: Re: multi level faceting >> >> Hi, >> >> I *think* this is not what Vincent was after. If I read the suggestions >> >> correctly, you are saying to use &fq=x&fq=y -- multiple fqs. >> But I think Vincent is wondering how to end up with something that will >> let him >> create a UI with multi-level facets (with a single request), e.g. >> >> Footwear (100) >> Sneakers (20) >> Men (1) >> Women (19) >> >> Dancing shoes (10) >> Men (0) >> Women (10) >> ... >> >> If this is what Vincent was after, I'd love to hear suggestions myself. >> :) >> >> Otis >> >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch >> Lucene ecosystem search :: http://search-lucene.com/ >> >> >> >> - Original Message >> >> >>> From: Jason Brown >>> To: solr-user@lucene.apache.org >>> Sent: Mon, October 4, 2010 11:34:56 AM >>> Subject: RE: multi level faceting >>> >>> Yes, by adding fq back into the main query you will get results >>> >>> >> increasingly >> >> >>> filtered each time. >>> >>> You may run into an issue if you are displaying facet counts, as the >>> >>> >> facet >> >> >>> part of the query will also obey the increasingly filtered fq, and so >>>
"OR" facet queries?
I want to enable users to select multiple facet values for a specific facet fields. For example, if "color" is a facet field, I'd like to let users to select "red" OR "blue". Please note, I've set because I want "q=hello+world" means "hello" and "world" are AND'ed together. 1) What is the syntax of doing that? Can I implement that by putting "OR" within the fq clause? E.g. &facet=on&facet.field=color&facet.field=size &fq=color:(red OR blue) &fq=size:(M OR L) 2) Is there a performance penalty associated with using "OR" on the facet values like that? If so how much of a penalty? Thanks
Re: xml-aware highlighting
> OK - I read a bit more and it > appears an appropriate analysis pipeline (which would > extract text from XML using SAX, say) is all that's > required, and existing highlighting ought to be able to > accomplish what I'm after. So I guess the only > question I have now before writing code is where is the > existing implementation :) - anyone? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory may remove xml tags too.
Re: "OR" facet queries?
> I want to enable users to select > multiple facet values for a specific facet fields. For > example, if "color" is a facet field, I'd like to let users > to select "red" OR "blue". > > Please note, I've set > > because I want "q=hello+world" means "hello" and "world" > are AND'ed together. > > 1) What is the syntax of doing that? Can I implement that > by putting "OR" within the fq clause? > E.g. > &facet=on&facet.field=color&facet.field=size > &fq=color:(red OR blue) > &fq=size:(M OR L) Yes you can do that filter queries. You may find this interesting. http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams
Re: xml-aware highlighting
Yes - that looks right; I was thrown a bit by the name - Thanks! On 10/9/2010 5:23 PM, Ahmet Arslan wrote: OK - I read a bit more and it appears an appropriate analysis pipeline (which would extract text from XML using SAX, say) is all that's required, and existing highlighting ought to be able to accomplish what I'm after. So I guess the only question I have now before writing code is where is the existing implementation :) - anyone? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory may remove xml tags too.
Re: Sorting on arbitary 'custom' fields
I'm confused. What do you mean that a user can "set any number of arbitrarily named fields on a document". It sounds like you are talking about a user adding arbitrarily may entries to a multi-valued field? Or is it some kind of key:value pairs in a field in your schema? Under any circumstances, sorting on a multi-valued field is...er... hard. What does sorting mean there? Sort by the first value entered? The second? The 15th? This is indeterminate behavior. What is the over-arching problem you're addressing? I wonder if this is an XY problem. see: http://people.apache.org/~hossman/#xyproblem Best Erick On Fri, Oct 8, 2010 at 8:15 PM, Simon Wistow wrote: > On Fri, Oct 08, 2010 at 04:56:38PM -0700, kenf_nc said: > > > > What behavior are you trying to see? You are allowed to sort on fields > that > > are potentially empty, they just sort to the top or bottom depending on > your > > sort order. Now, if you Query on the fields that could be empty, you > won't > > see the result, but if your document is valid for the query, you can sort > on > > whatever field you want whether the document has that field or not. > > A user can set any number of arbitarily named fields on a document. We'd > like to be able to sort by those fields. > > The problem is that users can set multiple arbitary fields and we may > have thousands of them - it would be impractical for us to have these as > actual fields in the schema. > > If I could sort on only the matching values of a multi valued field then > this would be easy - I'd just collapse down key / value pairs to > _ and then search for user_field:_* > > > > > >