Re: Multi-language indexing and searching
This sounds OK. I can create a field name mapping structure to change the requests / responses in a way my client doesn't need to be aware of different fields. Thanks for this directions, Daniel On 8/6/07 21:32, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > > : Can't I have the same index, using one single core, same field names being > : processed by language specific components based on a field/parameter? > > yes, but you don't really need the complexity you describe below ... you > don't need seperate request handlers per language, just seperate fields > per language. asusming you care about 3 concepts: title, author, body .. > in a single language index those might corrispond ot three fields, in your > index they corrispond to 3*N fields where N is the number of languages you > wnat to support... > >title_french >title_english >title_german >... >author_french >author_english >... > > documents which are in english only get values for th english fields, > documents in french etc... ... unless perhaps you want to support > "translations" of the documents in which case you can have values > fields for multiple langagues, it's up to you. When a user wants to query > in french, you take their input and query against the body_french field > and display the title_french field, etc... > > -Hoss > http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: How does HTMLStripWhitespaceTokenizerFactory work?
Ok. Is it possible to get back the content without the html tags? On 08/06/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 6/8/07, Thierry Collogne <[EMAIL PROTECTED]> wrote: > I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer > with no luck. [...] > Is this normal? Shouldn't the html code and the white spaces be removed from > the field? For indexing purposes, yes. The stored field you get back will be unchanged though. If you want to see what will be indexed, try the analysis debugger in the admin pages. -Yonik
Re: Multi-language indexing and searching
Hi Henri, Thanks again, your considerations will sure help on my decision. Now I'll do my homework to check document volume / growth - expected index sizes and query load. Regards, Daniel Alheiros On 9/6/07 10:53, "Henrib" <[EMAIL PROTECTED]> wrote: > > Hi Daniel, > Trying to recap: you are indexing documents that can be in different > language. On the query side, users will only search in one language at a > time & get results in that language. > > Setting aside the webapp deployment problem, the alternative is thus: > option1: 1 schema will all fields of all languages pre-defined > option2: 1 schema per lang with the same field names (but a different type). > > You indicate that your documents do have a field carrying the language. Is > the Solr document format the authoring format of the documents you index or > do they require some pre-processing to extract those fields? For instance, > are the source documents in HTML and pre-processed using some XPath/magic to > generate the fields? > In that case, using option1, the pre-processing transformation needs to know > which fields to generate according to the language. Option2 needs you to > know which core you need to target based on the lang. And it goes the same > way for querying; for option1, you need a query with different fields for > each language, option2 requires to target the correct core. > In the other case, ie if the Solr document format is the source format, > indexing requires some script (curl or else) to send them to Solr; having > the script determine which core to target don't seem (from far) a hard task > (grep/awk to the rescue :-)). > > On the maintenance side, if you were to change the schema, need to reindex > one lang or add a lang, option1 seems to have a 'wider' impact, the > functional grain being coarser. Besides, if your collections are huge or > grow fast, it might be nice to have an easy way to partition the workload on > different machines which seems easier with option2, directing indexing & > queries to a site based on the lang. > > On the webapp deployment side, option1 is a breeze, option2 requires > multiple web-app (Forgetting solr-215 patch that is unlikely to be reviewed > and accepted soon since its functional value is not shared). > > Hope this helps in your choice, regards, > Henri > > > > > > > > Daniel Alheiros wrote: >> >> Hi Henri. >> >> Thanks for your reply. >> I've just looked at the patch you referred, but doing this I will lose the >> out of the box Solr installation... I'll have to create my own Solr >> application responsible for creating the multiple cores and I'll have to >> change my indexing process to something able to notify content for a >> specific core. >> >> Can't I have the same index, using one single core, same field names being >> processed by language specific components based on a field/parameter? >> >> I will try to draw what I'm thinking, please forgive me if I'm not using >> the >> correct terms but I'm not an IR expert. >> >> Thinking in a workflow: >> Indexing: >> Multilanguage indexer receives some documents >> for each document, verify the "language" field >> if language = "English" then process using the >> EnglishIndexer >> else if language = "Chinese" then process using the >> ChineseIndexer >> else if ... >> >> Querying: >> Multilanguage Request Handler receives a request >> if parameter language = "English" then process using the >> English >> Request Handler >> else if parameter language = "Chinese" then process using the >> Chinese Request Handler >> else if ... >> >> I can see that in the schema field definitions, we have some language >> dependent parameters... It can be a problem, as I would like to have the >> same fields for all requests... >> >> Sorry to bother, but before I split all my data this way I would like to >> be >> sure that it's the best approach for me. >> >> Regards, >> Daniel >> >> >> On 8/6/07 15:15, "Henrib" <[EMAIL PROTECTED]> wrote: >> >>> >>> Hi Daniel, >>> If it is functionally 'ok' to search in only one lang at a time, you >>> could >>> try having one index per lang. Each per-lang index would have one schema >>> where you would describe field types (the lang part coming through >>> stemming/snowball analyzers, per-lang stopwords & al) and the same field >>> name could be used in each of them. >>> You could either deploy that solution through multiple web-apps (one per >>> lang) (or try the patch for issue Solr-215). >>> Regards, >>> Henri >>> >>> >>> Daniel Alheiros wrote: Hi, I'm just starting to use Solr and so far, it has been a very interesting learning process. I wasn't a Lucene user, so I'm learning a lot about both. My problem is: I have to index and search content in several languages. My scenario is a bit different fr
Re: How can I use dates to boost my results?
Hi Nick. I was exactly what I was looking for. Thanks, Daniel On 9/6/07 13:12, "Nick Jenkin" <[EMAIL PROTECTED]> wrote: > Hi Daniel > You can use a boosting function, > > In the dismax request handler insert the following: > > > recip(rord(created),1,1000,1000) > > > Obviously you will need to modify the values a bit, more info here: > http://wiki.apache.org/solr/FunctionQuery > > -Nick > > On 6/9/07, Daniel Alheiros <[EMAIL PROTECTED]> wrote: >> Hi >> >> For my search use, the document freshness is a relevant aspect that should >> be considered to boost results. >> >> I have a field in my index like this: >> >> >> >> How can I make a good use of this to boost my results? >> >> I'm using the DisMaxRequestHandler to boost other textual fields based on >> the query, but it would improve the results quality a lot if the date where >> considered to define the score. >> >> >> Best Regards, >> Daniel >> >> >> http://www.bbc.co.uk/ >> This e-mail (and any attachments) is confidential and may contain personal >> views which are not the views of the BBC unless specifically stated. >> If you have received it in error, please delete it from your system. >> Do not use, copy or disclose the information in any way nor act in reliance >> on it and notify the sender immediately. >> Please note that the BBC monitors e-mails sent or received. >> Further communication will signify your consent to this. >> >> http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: How does HTMLStripWhitespaceTokenizerFactory work?
On 11-Jun-07, at 3:54 AM, Thierry Collogne wrote: Ok. Is it possible to get back the content without the html tags? Well, it isn't stored anywhere in Solr. It's best to think of lucene/ solr as two systems: the indexer applies a tokenization transformation to the data and creates an inverted index; the storage system keeps track of the data you give it _before_ analysis/ tokenization. If there is analysis you'd like to do that also applies to the stored status of the doc, it's probably easier to apply it before passing the data to Solr. -MIke On 08/06/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 6/8/07, Thierry Collogne <[EMAIL PROTECTED]> wrote: > I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer > with no luck. [...] > Is this normal? Shouldn't the html code and the white spaces be removed from > the field? For indexing purposes, yes. The stored field you get back will be unchanged though. If you want to see what will be indexed, try the analysis debugger in the admin pages. -Yonik
Re: How does HTMLStripWhitespaceTokenizerFactory work?
: Ok. Is it possible to get back the content without the html tags? Solr never does anything to modify the "stored" value of a field, so you'd really need to send Solr the value after strpping the HTML to get this to work. Internally, the HTMLStripWhitespaceTokenizerFactory does the HTML stripping as part of the tokenization process, so there is never a single markup free value for the field in Solr. -Hoss
Re: problem with schema.xml
I am having a similar(?) problem with 1.2 upgraded from an earlier incubator release. We upgraded by building the new war with ant by and replacing jetty's webapps/solr.war -- changes to schema.xml are not taking place by the method of exchanging solr/conf/schema.xml for an updated one with a new field name="foobar" and restarting the sending the solr java process a TERM and starting afresh... -- Jonathan On Fri, Jun 08, 2007 at 03:17:30PM -0400, [EMAIL PROTECTED] wrote: > Hi Ryan, > > I have my .war file located outside the webapps folder (I am using multiple > Solr instances with a config as suggested on the wiki: > http://wiki.apache.org/solr/SolrTomcat). > > Nevertheless, I touched the .war file, the config file, the directory under > webapps, but nothing seems to be working. > > Any other suggestions? Is someone else experiencing the same problem? > thanks, > mirko > > > Quoting Ryan McKinley <[EMAIL PROTECTED]>: > > > I don't use tomcat, so I can't be particularly useful. The behavior you > > describe does not happen with resin or jetty... > > > > My guess is that tomcat is caching the error state. Since fixing the > > problem is outside the webapp directory, it does not think it has > > changed so it stays in a broken state. > > > > if you "touch" the .war file, does it restart ok? > > > > but i'm just guessing... > > > >
LIUS/Fulltext indexing
Anyone have experience working with LIUS ( http://sourceforge.net/projects/lius/)? I can't seem to find any real documentation on it, even though it seems 'active' @ sourceforge. I need a way to index various types of fulltext, and LIUS seems very promising at first glance. What do you guys think? Is there a similar implementation you recommend, even something that might provide the simple text extraction functionality for the various types? I figure, I would need to do that anyways, and massage the text into Solr-type docs. Vish
question about sorting
Hi, My sorting fields include both TextField type and StrField type. Because TextField uses TokenizerFactory, they can't be sorted. I have to copy each TextField to a StrField and sort on those StrFields. Does anyone know if there is a better way to do that? Thanks Xuesong
Re: question about sorting
On 6/11/07, Xuesong Luo <[EMAIL PROTECTED]> wrote: My sorting fields include both TextField type and StrField type. Because TextField uses TokenizerFactory, they can't be sorted. I have to copy each TextField to a StrField and sort on those StrFields. Does anyone know if there is a better way to do that? What information does this TextField carry? Sorting works on indexed field values, and thus needs to be single-valued per document. -Yonik
Re: LIUS/Fulltext indexing
On 6/11/07, Vish D. <[EMAIL PROTECTED]> wrote: Anyone have experience working with LIUS ( http://sourceforge.net/projects/lius/)? I can't seem to find any real documentation on it, even though it seems 'active' @ sourceforge. I need a way to index various types of fulltext, and LIUS seems very promising at first glance. What do you guys think? Is there a similar implementation you recommend, even something that might provide the simple text extraction functionality for the various types? I figure, I would need to do that anyways, and massage the text into Solr-type docs. I think Tika will be the way forward (some of the code for Tika is coming from LIUS) -Yonik
RE: question about sorting
For example, first name, department, job title etc. Thanks Xuesong -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Monday, June 11, 2007 6:35 PM To: solr-user@lucene.apache.org Subject: Re: question about sorting On 6/11/07, Xuesong Luo <[EMAIL PROTECTED]> wrote: > My sorting fields include both TextField type and StrField type. Because > TextField uses TokenizerFactory, they can't be sorted. I have to copy > each TextField to a StrField and sort on those StrFields. Does anyone > know if there is a better way to do that? What information does this TextField carry? Sorting works on indexed field values, and thus needs to be single-valued per document. -Yonik
fq with standard request handler
Hi, I'm trying to use 'fq' param (see http://wiki.apache.org/solr/CommonQueryParameters ) with the standard request handler, using a field that is defined as an integer (values 1 or 0), is indexed, and is stored. For some reason, these two return no hits, even though I do have MyIntField with values 0 and 1 in the index: http://localhost:8080/solr/select?q=birds&fq=MyIntField:0 http://localhost:8080/solr/select?q=birds&fq=MyIntField:1 So I tried these, just to see if that makes any difference: http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:0 http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:1 http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:[* TO *] No go - no hits. Am I doing something obviously wrong? I'm using a Solr nightly from maybe a month ago. I don't recall seeing any bugs with the 'fq' param. Thanks, Otis
Re: fq with standard request handler
On 11-Jun-07, at 7:22 PM, Otis Gospodnetic wrote: I'm trying to use 'fq' param (see http://wiki.apache.org/solr/ CommonQueryParameters ) with the standard request handler, using a field that is defined as an integer (values 1 or 0), is indexed, and is stored. For some reason, these two return no hits, even though I do have MyIntField with values 0 and 1 in the index: http://localhost:8080/solr/select?q=birds&fq=MyIntField:0 http://localhost:8080/solr/select?q=birds&fq=MyIntField:1 So I tried these, just to see if that makes any difference: http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:0 http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:1 http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:[* TO *] No go - no hits. Am I doing something obviously wrong? I'm using a Solr nightly from maybe a month ago. I don't recall seeing any bugs with the 'fq' param. er... since the second batch of queries returned no hits, does that not indicate that the problem _isn't_ with fq? You practically stripped it down to raw lucene territory here. -MIke
Re: fq with standard request handler
: er... since the second batch of queries returned no hits, does that : not indicate that the problem _isn't_ with fq? You practically : stripped it down to raw lucene territory here. yeah, i'm with mike ... if q=birds AND MyIntField:0 returns no hits, it doesn't suprise me that q=birds&fq=MyIntField:0 returns no hits. what does q=MyIntField:0 return? have you tried debugQuery=true & explainOther=(some query that matches a doc you expect to get from the main query) ? -Hoss
Re: problem with schema.xml
: replacing jetty's webapps/solr.war -- changes to schema.xml are not : taking place by the method of exchanging solr/conf/schema.xml for an : updated one with a new field name="foobar" and restarting the sending : the solr java process a TERM and starting afresh... you're terminated the java process and starting a new one and it's not seeing the new file? That's, ... hmm, i really have no idea what that is. Are you sure the first process is completely shutting down? If the old process is still listening on the port the new process won't start up cleanly. Other then that, i can't think of any reason why a new java process wouldn't see the new file. -Hoss
Re: fq with standard request handler
On 6/11/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: Hi, I'm trying to use 'fq' param (see http://wiki.apache.org/solr/CommonQueryParameters ) with the standard request handler, using a field that is defined as an integer (values 1 or 0), is indexed, and is stored. For some reason, these two return no hits, even though I do have MyIntField with values 0 and 1 in the index: http://localhost:8080/solr/select?q=birds&fq=MyIntField:0 http://localhost:8080/solr/select?q=birds&fq=MyIntField:1 So I tried these, just to see if that makes any difference: http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:0 http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:1 http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:[* TO *] No go - no hits. Am I doing something obviously wrong? I'm using a Solr nightly from maybe a month ago. I don't recall seeing any bugs with the 'fq' param. Further out in left field, perhaps a type mismatch with MyIntField? Did you index the data with Solr as well? Perhaps try faceting on MyIntField to see what the indexed values are... -Yonik
Re: How does HTMLStripWhitespaceTokenizerFactory work?
Ok. Thanks for the clarification. We will do the stripping before the indexing. On 11/06/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: : Ok. Is it possible to get back the content without the html tags? Solr never does anything to modify the "stored" value of a field, so you'd really need to send Solr the value after strpping the HTML to get this to work. Internally, the HTMLStripWhitespaceTokenizerFactory does the HTML stripping as part of the tokenization process, so there is never a single markup free value for the field in Solr. -Hoss
Re: LIUS/Fulltext indexing
On 6/12/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: ... I think Tika will be the way forward (some of the code for Tika is coming from LIUS)... Work has indeed started to incoroporate the Lius code into Tika, see https://issues.apache.org/jira/browse/TIKA-7 and http://incubator.apache.org/projects/tika.html -Bertrand