solr index reusable with nutch?
Hi all, is it possible to directly use the solr index in nutch? My client is creating a portal search based on nutch. In this portal there is as well my project and ATM I prefer to go with solr instead of nutch since it its much better for my use case. Now the question is whether the portal search engine could use the solr index for my part of the portal. Can somebody point me to related documentation? TIA salu2 -- thorsten "Together we stand, divided we fall!" Hey you (Pink Floyd)
Re: solr index reusable with nutch?
Hi, Solr should be able to search any Lucene index, not just those created by Solr itself, as long as you configure it properly via schema.xml. Thus, you should be able to use Solr to search an index created by Nutch. Haven't tried it. It would be nice if you could contribute the configuration for doing this. Otis - Original Message From: Thorsten Scherler <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, December 13, 2006 8:26:51 AM Subject: solr index reusable with nutch? Hi all, is it possible to directly use the solr index in nutch? My client is creating a portal search based on nutch. In this portal there is as well my project and ATM I prefer to go with solr instead of nutch since it its much better for my use case. Now the question is whether the portal search engine could use the solr index for my part of the portal. Can somebody point me to related documentation? TIA salu2 -- thorsten "Together we stand, divided we fall!" Hey you (Pink Floyd)
'New' "Date Math" parsing code in Solr
(I have this nasty habit of commiting cool things to Solr that should be announced on solr-user, and then deciding I'll wait untill they are in a nightly snapshot before I send an email about them -- and then forgetting that I never sent the mail). A while back I added some functionality to the DateField class which extends it's parsing abilities so that it can recognize strings like "NOW", "NOW+1DAY", "NOW+1DAY-3HOURS", and even "NOW/MONTH+3MONTHS" which means round down to the nearest month, then add three months. You can play with this syntax and see what exactly it does with various inputs by looking at the "parsedquery" debug info for any date field, if you are running the example config from Solr something like this will work... http://localhost:8983/solr/select?version=2.1&q=field_dt%3A%5BNOW+TO+NOW%2FDAY%2B1MONTH%5D&start=0&rows=0&debugQuery=on This syntax was added to not only make it easier to run quick data inspection queries like "startDate:[NOW TO *]" but also so that *relative* date based queries can be included directly in the default query options for request handlers configured in solrconfig.xml. For example, if you only want to let people search articles less then a month old, you could put... pubDate:[NOW/DAY-1MONTH] ...in your requestHandler config, and that cached filter query will be reusable for 24 hours. This syntax is supported anywhere Solr parses a DateField, so it can even be used in values when sending messages. More info can be found in the javadocs... http://incubator.apache.org/solr/docs/api/org/apache/solr/schema/DateField.html http://incubator.apache.org/solr/docs/api/org/apache/solr/util/DateMathParser.html -Hoss
Re: automatic index time field?
: Is there a way to automatically set a field when a document is indexed? : Specifically, I'd like to have a date field updated to the current time when : a document is indexed. Your message reminded me that i never announced the new "Date Match" parsing code, which does let you say something like... NOW ...in your calls, but there is currently no way to have "default" values for fields in your schema ... it's on the wishlist, but no one is currently pursueing it as far as i know. : I have a bunch of stuff stored in SQL, my plan is to: : * note the current time ...the gist of your plan is sound, but to eliminate possible headaches from clock sync issues, instead of getting the "current time" from somewhere, i would query your index for the all docs (of the type you are interested in) sorted by date desc, and then note the date of the newst doc and later delete all docs with dates up to and including that one. : My options are: : 1) Send the index time along with the document. : 2) extend UpdateHandler (DirectUpdateHandler2) to do this automatically : : 1) is the easiest but requires that everyone sending data sends a valid : "index_time" field. : 2) more complicated, but then we know everything has a valid "index_time" : field. As i said, you could just put "NOW" in all of your docs, but if you are interested in pursuing option#2, the most general purpose and reusable approach miht be to add an optional default="value" attribute to the declarations in the schema.xml (relevant classes are SchemaField and IndexSchema) and then modify the DocumentBuilder.getDoc method to check for any default values of fields the Document doesn't already have values for and add them .. then your timestamp field becomes... ..but you can also have other default fields... ...etc. -Hoss
Re: Strange Sorting results on a Text Field
Despite considerations of stemming and such for "text" type fields, is it the case that if we have a single value "text" type field, will sorting work, though? --tracey On 9/11/06, Tom Weber <[EMAIL PROTECTED]> wrote: Thanks also for the "multiValued" explanation, this is useful for my current application. But then, if I use this field and I ask for sorting, how will the sorting be done, alphanumeric on the first entry for this field ? Until now, I entered more than one entry by separting them with a space in the same field, like text1 text2 text3. Sorting is currently only supported when there is at most one value (or token) per document. This is a lucene restriction. -Yonik
Re: automatic index time field?
thanks for the advice. I implemented option #2, followed the directions on: http://wiki.apache.org/solr/HowToContribute and made: http://issues.apache.org/jira/browse/SOLR-82 The only change I might make is to have the schema store if it has fields with default values so that DocumentBuilder.getDoc() does not cycle through all fields if there aren't any. Thanks ryan On 12/13/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Is there a way to automatically set a field when a document is indexed? : Specifically, I'd like to have a date field updated to the current time when : a document is indexed. Your message reminded me that i never announced the new "Date Match" parsing code, which does let you say something like... NOW ...in your calls, but there is currently no way to have "default" values for fields in your schema ... it's on the wishlist, but no one is currently pursueing it as far as i know. : I have a bunch of stuff stored in SQL, my plan is to: : * note the current time ...the gist of your plan is sound, but to eliminate possible headaches from clock sync issues, instead of getting the "current time" from somewhere, i would query your index for the all docs (of the type you are interested in) sorted by date desc, and then note the date of the newst doc and later delete all docs with dates up to and including that one. : My options are: : 1) Send the index time along with the document. : 2) extend UpdateHandler (DirectUpdateHandler2) to do this automatically : : 1) is the easiest but requires that everyone sending data sends a valid : "index_time" field. : 2) more complicated, but then we know everything has a valid "index_time" : field. As i said, you could just put "NOW" in all of your docs, but if you are interested in pursuing option#2, the most general purpose and reusable approach miht be to add an optional default="value" attribute to the declarations in the schema.xml (relevant classes are SchemaField and IndexSchema) and then modify the DocumentBuilder.getDoc method to check for any default values of fields the Document doesn't already have values for and add them .. then your timestamp field becomes... ..but you can also have other default fields... ...etc. -Hoss
Case sensitivity on hostnames and email addresses
I've run into some unexpected case sensitivity on searches, at least unexpected by me. If you index a text field containing this sentence: A sentence containing CamelCase words by [EMAIL PROTECTED] is found at StudlyCaps.org The document will be found by searching for "camelcase" but not for "[EMAIL PROTECTED]" or "studlycaps.org". This happens with the Standard or the DisMax query handler. A bit of a problem for me, because I'm indexing a bunch of business magazines, and domain names are frequently capitalized, often in CamelCase. Is this maybe a bug? Or a WAD? -- Wade Leftwich Ithaca, NY
Re: Case sensitivity on hostnames and email addresses
When indexing (and searching), make sure you are using an Analyzer that lower-cases (or upper-cases) tokens. These are from Lucene, so Solr has them, too: ./src/java/org/apache/lucene/analysis/LowerCaseTokenizer.java ./src/java/org/apache/lucene/analysis/LowerCaseFilter.java Otis - Original Message From: Wade Leftwich <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, December 13, 2006 11:32:11 PM Subject: Case sensitivity on hostnames and email addresses I've run into some unexpected case sensitivity on searches, at least unexpected by me. If you index a text field containing this sentence: A sentence containing CamelCase words by [EMAIL PROTECTED] is found at StudlyCaps.org The document will be found by searching for "camelcase" but not for "[EMAIL PROTECTED]" or "studlycaps.org". This happens with the Standard or the DisMax query handler. A bit of a problem for me, because I'm indexing a bunch of business magazines, and domain names are frequently capitalized, often in CamelCase. Is this maybe a bug? Or a WAD? -- Wade Leftwich Ithaca, NY
Re: Case sensitivity on hostnames and email addresses
Also, avoid stemming URLs. I used a stemmer that turned my "best.com" URL into "good.com". The Lucene StandardAnalyzer works pretty hard to avoid that. --wunder On 12/13/06 9:33 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote: > When indexing (and searching), make sure you are using an Analyzer that > lower-cases (or upper-cases) tokens. > These are from Lucene, so Solr has them, too: > ./src/java/org/apache/lucene/analysis/LowerCaseTokenizer.java > ./src/java/org/apache/lucene/analysis/LowerCaseFilter.java > > Otis > > - Original Message > From: Wade Leftwich <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Wednesday, December 13, 2006 11:32:11 PM > Subject: Case sensitivity on hostnames and email addresses > > I've run into some unexpected case sensitivity on searches, at least > unexpected by me. > > If you index a text field containing this sentence: > > A sentence containing CamelCase words by [EMAIL PROTECTED] is found > at StudlyCaps.org > > The document will be found by searching for "camelcase" but not for > "[EMAIL PROTECTED]" or "studlycaps.org". > > This happens with the Standard or the DisMax query handler. > > A bit of a problem for me, because I'm indexing a bunch of business > magazines, and domain names are frequently capitalized, often in CamelCase. > > Is this maybe a bug? Or a WAD? > > -- Wade Leftwich > Ithaca, NY > > > >
Re: Case sensitivity on hostnames and email addresses
On 12/13/06, Wade Leftwich <[EMAIL PROTECTED]> wrote: I've run into some unexpected case sensitivity on searches, at least unexpected by me. If you index a text field containing this sentence: A sentence containing CamelCase words by [EMAIL PROTECTED] is found at StudlyCaps.org The document will be found by searching for "camelcase" but not for "[EMAIL PROTECTED]" or "studlycaps.org". This happens with the Standard or the DisMax query handler. A bit of a problem for me, because I'm indexing a bunch of business magazines, and domain names are frequently capitalized, often in CamelCase. It's your text analysis configuration. The WordDelimiterFilter is doing this... it's so "CamelCase" can be found searching for "camelcase", "camel-case" or "camel case". It does this by detecting all the word parts and then indexing them separately as well as all catenated. So "CamelCase" is indexed as both both "camelcase" and "camel case". When searching, the WordDelimiterFilter is configured to split only, so "camelcase", "camel-case", and "camel case" will all match. When it hits something like [EMAIL PROTECTED], it would index it as "upanddownmysitecom" and "up and down mysite com" On the search side, a search of "[EMAIL PROTECTED]" is broken into "upanddown mysite com" which doesn't match anything indexed. There are a number of options, not limited to - create a new fieldtype and throw out the WordDelimiterFilter... the current "text" field type is for demonstration purposes only anyway. Solr, like Lucene, is meant to be customized. - If you want to keep the camel-case flexibility, but not across "." and "-", then try using a letter tokenizer to throw away the non-letter tokenizers first. - create a specific filter for email or website addresses if no combination of existing filters do what you want. Play around with the analysis tool on the admin page, it will help you understand what's going on. -Yonik
Re: Case sensitivity on hostnames and email addresses
Oh, and yet another way to get around it (with it's own trade offs) is to use something like fieldtype textTight in the example schema.xml, which catenates all word parts in both the index analyzer and query analyzer. This would index as "upanddownmysitecom" and allow the following queries to match: "[EMAIL PROTECTED]", "[EMAIL PROTECTED]/com", "[EMAIL PROTECTED]" The downside is that it would *not* allow "upanddown" or "UpAndDown" to match. -Yonik On 12/14/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 12/13/06, Wade Leftwich <[EMAIL PROTECTED]> wrote: > I've run into some unexpected case sensitivity on searches, at least > unexpected by me. > > If you index a text field containing this sentence: > > A sentence containing CamelCase words by [EMAIL PROTECTED] is found > at StudlyCaps.org > > The document will be found by searching for "camelcase" but not for > "[EMAIL PROTECTED]" or "studlycaps.org". > > This happens with the Standard or the DisMax query handler. > > A bit of a problem for me, because I'm indexing a bunch of business > magazines, and domain names are frequently capitalized, often in CamelCase. It's your text analysis configuration. The WordDelimiterFilter is doing this... it's so "CamelCase" can be found searching for "camelcase", "camel-case" or "camel case". It does this by detecting all the word parts and then indexing them separately as well as all catenated. So "CamelCase" is indexed as both both "camelcase" and "camel case". When searching, the WordDelimiterFilter is configured to split only, so "camelcase", "camel-case", and "camel case" will all match. When it hits something like [EMAIL PROTECTED], it would index it as "upanddownmysitecom" and "up and down mysite com" On the search side, a search of "[EMAIL PROTECTED]" is broken into "upanddown mysite com" which doesn't match anything indexed. There are a number of options, not limited to - create a new fieldtype and throw out the WordDelimiterFilter... the current "text" field type is for demonstration purposes only anyway. Solr, like Lucene, is meant to be customized. - If you want to keep the camel-case flexibility, but not across "." and "-", then try using a letter tokenizer to throw away the non-letter tokenizers first. - create a specific filter for email or website addresses if no combination of existing filters do what you want. Play around with the analysis tool on the admin page, it will help you understand what's going on. -Yonik
Re: Strange Sorting results on a Text Field
: Despite considerations of stemming and such for "text" : type fields, is it the case that : if we have a single value "text" type field, : will sorting work, though? correct ... KeywordTokenizer with Filters of your choice should produce a sortable string of whatever form you desire. -Hoss
Re: solr index reusable with nutch?
On Wed, 2006-12-13 at 07:45 -0800, Otis Gospodnetic wrote: > Hi, > > Solr should be able to search any Lucene index, ok, good to know. :) So can I guess that the same is true for nutch? Meaning the index solr is creating could be used by a nutch searcher. > not just those created by Solr itself, as long as you configure it properly > via schema.xml. http://wiki.apache.org/solr/SchemaXml?highlight=%28schema%29 > Thus, you should be able to use Solr to search an index created by Nutch. In my use case I need the reverse. Nutch searches the index created by my solr application. The application is just one component in the portal and the portal will provide a "global" search engine which should use the index from solr. > Haven't tried it. It would be nice if you could contribute the > configuration for doing this. > As I figure it out I will keep you informed. Thanks for the feedback. salu2 > Otis > > - Original Message > From: Thorsten Scherler <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Wednesday, December 13, 2006 8:26:51 AM > Subject: solr index reusable with nutch? > > Hi all, > > is it possible to directly use the solr index in nutch? > > My client is creating a portal search based on nutch. In this portal > there is as well my project and ATM I prefer to go with solr instead of > nutch since it its much better for my use case. > > Now the question is whether the portal search engine could use the solr > index for my part of the portal. > > Can somebody point me to related documentation? > > TIA > > salu2