Re: Multi-language indexing and searching

2007-06-11 Thread Daniel Alheiros
This sounds OK.

I can create a field name mapping structure to change the requests /
responses in a way my client doesn't need to be aware of different fields.

Thanks for this directions,
Daniel


On 8/6/07 21:32, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : Can't I have the same index, using one single core, same field names being
> : processed by language specific components based on a field/parameter?
> 
> yes, but you don't really need the complexity you describe below ... you
> don't need seperate request handlers per language, just seperate fields
> per language.  asusming you care about 3 concepts: title, author, body ..
> in a single language index those might corrispond ot three fields, in your
> index they corrispond to 3*N fields where N is the number of languages you
> wnat to support...
> 
>title_french
>title_english
>title_german
>...
>author_french
>author_english
>...
> 
> documents which are in english only get values for th english fields,
> documents in french etc... ... unless perhaps you want to support
> "translations" of the documents in which case you can have values
> fields for multiple langagues, it's up to you.  When a user wants to query
> in french, you take their input and query against the body_french field
> and display the title_french field, etc...
> 
> -Hoss
> 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: How does HTMLStripWhitespaceTokenizerFactory work?

2007-06-11 Thread Thierry Collogne

Ok. Is it possible to get back the content without the html tags?

On 08/06/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 6/8/07, Thierry Collogne <[EMAIL PROTECTED]> wrote:
> I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer
> with no luck.
[...]
> Is this normal? Shouldn't the html code and the white spaces be removed
from
> the field?

For indexing purposes, yes.  The stored field you get back will be
unchanged though.
If you want to see what will be indexed, try the analysis debugger in
the admin pages.

-Yonik



Re: Multi-language indexing and searching

2007-06-11 Thread Daniel Alheiros
Hi Henri,

Thanks again, your considerations will sure help on my decision.
Now I'll do my homework to check document volume / growth - expected index
sizes and query load.

Regards,
Daniel Alheiros


On 9/6/07 10:53, "Henrib" <[EMAIL PROTECTED]> wrote:

> 
> Hi Daniel,
> Trying to recap: you are indexing documents that can be in different
> language. On the query side, users will only search in one language at a
> time & get results in that language.
> 
> Setting aside the webapp deployment problem, the alternative is thus:
> option1: 1 schema will all fields of all languages pre-defined
> option2: 1 schema per lang with the same field names (but a different type).
> 
> You indicate that your documents do have a field carrying the language. Is
> the Solr document format the authoring format of the documents you index or
> do they require some pre-processing to extract those fields? For instance,
> are the source documents in HTML and pre-processed using some XPath/magic to
> generate the fields?
> In that case, using option1, the pre-processing transformation needs to know
> which fields to generate according to the language. Option2 needs you to
> know which core you need to target based on the lang. And it goes the same
> way for querying; for option1, you need a query with different fields for
> each language, option2 requires to target the correct core.
> In the other case, ie if the Solr document format is the source format,
> indexing requires some script (curl or else) to send them to Solr; having
> the script determine which core to target don't seem (from far) a hard task
> (grep/awk  to the rescue :-)).
> 
> On the maintenance side, if you were to change the schema, need to reindex
> one lang or add a lang, option1 seems to have a 'wider' impact, the
> functional grain being coarser. Besides, if your collections are huge or
> grow fast, it might be nice to have an easy way to partition the workload on
> different machines which seems easier with option2, directing indexing &
> queries to a site based on the lang.
> 
> On the webapp deployment side, option1 is a breeze, option2 requires
> multiple web-app (Forgetting solr-215 patch that is unlikely to be reviewed
> and accepted soon since its functional value is not shared).
> 
> Hope this helps in your choice, regards,
> Henri
> 
> 
> 
> 
> 
> 
> 
> Daniel Alheiros wrote:
>> 
>> Hi Henri.
>> 
>> Thanks for your reply.
>> I've just looked at the patch you referred, but doing this I will lose the
>> out of the box Solr installation... I'll have to create my own Solr
>> application responsible for creating the multiple cores and I'll have to
>> change my indexing process to something able to notify content for a
>> specific core.
>> 
>> Can't I have the same index, using one single core, same field names being
>> processed by language specific components based on a field/parameter?
>> 
>> I will try to draw what I'm thinking, please forgive me if I'm not using
>> the
>> correct terms but I'm not an IR expert.
>> 
>> Thinking in a workflow:
>> Indexing:
>> Multilanguage indexer receives some documents
>> for each document, verify the "language" field
>> if language = "English" then process using the
>> EnglishIndexer
>> else if language = "Chinese" then process using the
>> ChineseIndexer
>> else if ...
>> 
>> Querying:
>> Multilanguage Request Handler receives a request
>> if parameter language = "English" then process using the
>> English
>> Request Handler
>> else if parameter language = "Chinese" then process using the
>> Chinese Request Handler
>> else if ...
>> 
>> I can see that in the schema field definitions, we have some language
>> dependent parameters... It can be a problem, as I would like to have the
>> same fields for all requests...
>> 
>> Sorry to bother, but before I split all my data this way I would like to
>> be
>> sure that it's the best approach for me.
>> 
>> Regards,
>> Daniel
>> 
>> 
>> On 8/6/07 15:15, "Henrib" <[EMAIL PROTECTED]> wrote:
>> 
>>> 
>>> Hi Daniel,
>>> If it is functionally 'ok' to search in only one lang at a time, you
>>> could
>>> try having one index per lang. Each per-lang index would have one schema
>>> where you would describe field types (the lang part coming through
>>> stemming/snowball analyzers, per-lang stopwords & al) and the same field
>>> name could be used in each of them.
>>> You could either deploy that solution through multiple web-apps (one per
>>> lang) (or try the patch for issue Solr-215).
>>> Regards,
>>> Henri
>>> 
>>> 
>>> Daniel Alheiros wrote:
 
 Hi, 
 
 I'm just starting to use Solr and so far, it has been a very interesting
 learning process. I wasn't a Lucene user, so I'm learning a lot about
 both.
 
 My problem is:
 I have to index and search content in several languages.
 
 My scenario is a bit different fr

Re: How can I use dates to boost my results?

2007-06-11 Thread Daniel Alheiros
Hi Nick.

I was exactly what I was looking for.

Thanks,
Daniel


On 9/6/07 13:12, "Nick Jenkin" <[EMAIL PROTECTED]> wrote:

> Hi Daniel
> You can use a boosting function,
> 
> In the dismax request handler insert the following:
> 
> 
> recip(rord(created),1,1000,1000)
> 
> 
> Obviously you will need to modify the values a bit, more info here:
> http://wiki.apache.org/solr/FunctionQuery
> 
> -Nick
> 
> On 6/9/07, Daniel Alheiros <[EMAIL PROTECTED]> wrote:
>> Hi
>> 
>> For my search use, the document freshness is a relevant aspect that should
>> be considered to boost results.
>> 
>> I have a field in my index like this:
>> 
>> 
>> 
>> How can I make a good use of this to boost my results?
>> 
>> I'm using the DisMaxRequestHandler to boost other textual fields based on
>> the query, but it would improve the results quality a lot if the date where
>> considered to define the score.
>> 
>> 
>> Best Regards,
>> Daniel
>> 
>> 
>> http://www.bbc.co.uk/
>> This e-mail (and any attachments) is confidential and may contain personal
>> views which are not the views of the BBC unless specifically stated.
>> If you have received it in error, please delete it from your system.
>> Do not use, copy or disclose the information in any way nor act in reliance
>> on it and notify the sender immediately.
>> Please note that the BBC monitors e-mails sent or received.
>> Further communication will signify your consent to this.
>> 
>> 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: How does HTMLStripWhitespaceTokenizerFactory work?

2007-06-11 Thread Mike Klaas

On 11-Jun-07, at 3:54 AM, Thierry Collogne wrote:


Ok. Is it possible to get back the content without the html tags?



Well, it isn't stored anywhere in Solr.  It's best to think of lucene/ 
solr as two systems: the indexer applies a tokenization  
transformation to the data and creates an inverted index; the storage  
system keeps track of the data you give it _before_ analysis/ 
tokenization.  If there is analysis you'd like to do that also  
applies to the stored status of the doc, it's probably easier to  
apply it before passing the data to Solr.


-MIke


On 08/06/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 6/8/07, Thierry Collogne <[EMAIL PROTECTED]> wrote:
> I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory  
analyzer

> with no luck.
[...]
> Is this normal? Shouldn't the html code and the white spaces be  
removed

from
> the field?

For indexing purposes, yes.  The stored field you get back will be
unchanged though.
If you want to see what will be indexed, try the analysis debugger in
the admin pages.

-Yonik





Re: How does HTMLStripWhitespaceTokenizerFactory work?

2007-06-11 Thread Chris Hostetter

: Ok. Is it possible to get back the content without the html tags?

Solr never does anything to modify the "stored" value of a field, so you'd
really need to send Solr the value after strpping the HTML to get this to
work.

Internally, the HTMLStripWhitespaceTokenizerFactory does the HTML
stripping as part of the tokenization process, so there is never a
single markup free value for the field in Solr.





-Hoss



Re: problem with schema.xml

2007-06-11 Thread Jonathan Traylor
I am having a similar(?) problem with 1.2 upgraded from an earlier
incubator release. We upgraded by building the new war with ant by and
replacing jetty's webapps/solr.war -- changes to schema.xml are not
taking place by the method of exchanging solr/conf/schema.xml for an
updated one with a new field name="foobar" and restarting the sending
the solr java process a TERM and starting afresh... 

--
Jonathan

On Fri, Jun 08, 2007 at 03:17:30PM -0400, [EMAIL PROTECTED] wrote:
> Hi Ryan,
> 
> I have my .war file located outside the webapps folder (I am using multiple
> Solr instances with a config as suggested on the wiki:
> http://wiki.apache.org/solr/SolrTomcat).
> 
> Nevertheless, I touched the .war file, the config file, the directory under
> webapps, but nothing seems to be working.
> 
> Any other suggestions?  Is someone else experiencing the same problem?
> thanks,
> mirko
> 
> 
> Quoting Ryan McKinley <[EMAIL PROTECTED]>:
> 
> > I don't use tomcat, so I can't be particularly useful.  The behavior you
> > describe does not happen with resin or jetty...
> >
> > My guess is that tomcat is caching the error state.  Since fixing the
> > problem is outside the webapp directory, it does not think it has
> > changed so it stays in a broken state.
> >
> > if you "touch" the .war file, does it restart ok?
> >
> > but i'm just guessing...
> >
> >


LIUS/Fulltext indexing

2007-06-11 Thread Vish D.

Anyone have experience working with LIUS (
http://sourceforge.net/projects/lius/)? I can't seem to find any real
documentation on it, even though it seems 'active' @ sourceforge. I need a
way to index various types of fulltext, and LIUS seems very promising at
first glance. What do you guys think? Is there a similar implementation you
recommend, even something that might provide the simple text extraction
functionality for the various types? I figure, I would need to do that
anyways, and massage the text into Solr-type docs.

Vish


question about sorting

2007-06-11 Thread Xuesong Luo
Hi,
My sorting fields include both TextField type and StrField type. Because
TextField uses TokenizerFactory, they can't be sorted. I have to copy
each TextField to a StrField and sort on those StrFields. Does anyone
know if there is a better way to do that?

Thanks
Xuesong



Re: question about sorting

2007-06-11 Thread Yonik Seeley

On 6/11/07, Xuesong Luo <[EMAIL PROTECTED]> wrote:

My sorting fields include both TextField type and StrField type. Because
TextField uses TokenizerFactory, they can't be sorted. I have to copy
each TextField to a StrField and sort on those StrFields. Does anyone
know if there is a better way to do that?


What information does this TextField carry?
Sorting works on indexed field values, and thus needs to be
single-valued per document.

-Yonik


Re: LIUS/Fulltext indexing

2007-06-11 Thread Yonik Seeley

On 6/11/07, Vish D. <[EMAIL PROTECTED]> wrote:

Anyone have experience working with LIUS (
http://sourceforge.net/projects/lius/)? I can't seem to find any real
documentation on it, even though it seems 'active' @ sourceforge. I need a
way to index various types of fulltext, and LIUS seems very promising at
first glance. What do you guys think? Is there a similar implementation you
recommend, even something that might provide the simple text extraction
functionality for the various types? I figure, I would need to do that
anyways, and massage the text into Solr-type docs.


I think Tika will be the way forward (some of the code for Tika is
coming from LIUS)

-Yonik


RE: question about sorting

2007-06-11 Thread Xuesong Luo
For example, first name, department, job title etc.

Thanks
Xuesong

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Monday, June 11, 2007 6:35 PM
To: solr-user@lucene.apache.org
Subject: Re: question about sorting

On 6/11/07, Xuesong Luo <[EMAIL PROTECTED]> wrote:
> My sorting fields include both TextField type and StrField type.
Because
> TextField uses TokenizerFactory, they can't be sorted. I have to copy
> each TextField to a StrField and sort on those StrFields. Does anyone
> know if there is a better way to do that?

What information does this TextField carry?
Sorting works on indexed field values, and thus needs to be
single-valued per document.

-Yonik




fq with standard request handler

2007-06-11 Thread Otis Gospodnetic
Hi,

I'm trying to use 'fq' param (see 
http://wiki.apache.org/solr/CommonQueryParameters ) with the standard request 
handler, using a field that is defined as an integer (values 1 or 0), is 
indexed, and is stored.  For some reason, these two return no hits, even though 
I do have MyIntField with values 0 and 1 in the index:

  http://localhost:8080/solr/select?q=birds&fq=MyIntField:0
  http://localhost:8080/solr/select?q=birds&fq=MyIntField:1


So I tried these, just to see if that makes any difference:
  http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:0
  http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:1
  http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:[* TO *]

No go - no hits.  Am I doing something obviously wrong?  I'm using a Solr 
nightly from maybe a month ago.  I don't recall seeing any bugs with the 'fq' 
param.

Thanks,
Otis




Re: fq with standard request handler

2007-06-11 Thread Mike Klaas

On 11-Jun-07, at 7:22 PM, Otis Gospodnetic wrote:

I'm trying to use 'fq' param (see http://wiki.apache.org/solr/ 
CommonQueryParameters ) with the standard request handler, using a  
field that is defined as an integer (values 1 or 0), is indexed,  
and is stored.  For some reason, these two return no hits, even  
though I do have MyIntField with values 0 and 1 in the index:


  http://localhost:8080/solr/select?q=birds&fq=MyIntField:0
  http://localhost:8080/solr/select?q=birds&fq=MyIntField:1


So I tried these, just to see if that makes any difference:
  http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:0
  http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:1
  http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:[*  
TO *]


No go - no hits.  Am I doing something obviously wrong?  I'm using  
a Solr nightly from maybe a month ago.  I don't recall seeing any  
bugs with the 'fq' param.


er... since the second batch of queries returned no hits, does that  
not indicate that the problem _isn't_ with fq?  You practically  
stripped it down to raw lucene territory here.


-MIke


Re: fq with standard request handler

2007-06-11 Thread Chris Hostetter

: er... since the second batch of queries returned no hits, does that
: not indicate that the problem _isn't_ with fq?  You practically
: stripped it down to raw lucene territory here.

yeah, i'm with mike ... if q=birds AND MyIntField:0 returns no hits, it
doesn't suprise me that q=birds&fq=MyIntField:0 returns no hits.  what
does q=MyIntField:0 return?

have you tried debugQuery=true & explainOther=(some query that matches a
doc you expect to get from the main query)  ?


-Hoss



Re: problem with schema.xml

2007-06-11 Thread Chris Hostetter
: replacing jetty's webapps/solr.war -- changes to schema.xml are not
: taking place by the method of exchanging solr/conf/schema.xml for an
: updated one with a new field name="foobar" and restarting the sending
: the solr java process a TERM and starting afresh...

you're terminated the java process and starting a new one and it's not
seeing the new file?  That's, ... hmm, i really have no idea what that is.

Are you sure the first process is completely shutting down?  If the old
process is still listening on the port the new process won't start up
cleanly.  Other then that, i can't think of any reason why a new java
process wouldn't see the new file.


-Hoss



Re: fq with standard request handler

2007-06-11 Thread Yonik Seeley

On 6/11/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

Hi,

I'm trying to use 'fq' param (see 
http://wiki.apache.org/solr/CommonQueryParameters ) with the standard request 
handler, using a field that is defined as an integer (values 1 or 0), is 
indexed, and is stored.  For some reason, these two return no hits, even though 
I do have MyIntField with values 0 and 1 in the index:

  http://localhost:8080/solr/select?q=birds&fq=MyIntField:0
  http://localhost:8080/solr/select?q=birds&fq=MyIntField:1


So I tried these, just to see if that makes any difference:
  http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:0
  http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:1
  http://localhost:8080/solr/select?q=birds%20AND%20MyIntField:[* TO *]

No go - no hits.  Am I doing something obviously wrong?  I'm using a Solr 
nightly from maybe a month ago.  I don't recall seeing any bugs with the 'fq' 
param.


Further out in left field, perhaps a type mismatch with MyIntField?
Did you index the data with Solr as well?
Perhaps try faceting on MyIntField to see what the indexed values are...

-Yonik


Re: How does HTMLStripWhitespaceTokenizerFactory work?

2007-06-11 Thread Thierry Collogne

Ok. Thanks for the clarification. We will do the stripping before the
indexing.

On 11/06/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: Ok. Is it possible to get back the content without the html tags?

Solr never does anything to modify the "stored" value of a field, so you'd
really need to send Solr the value after strpping the HTML to get this to
work.

Internally, the HTMLStripWhitespaceTokenizerFactory does the HTML
stripping as part of the tokenization process, so there is never a
single markup free value for the field in Solr.





-Hoss




Re: LIUS/Fulltext indexing

2007-06-11 Thread Bertrand Delacretaz

On 6/12/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:


... I think Tika will be the way forward (some of the code for Tika is
coming from LIUS)...


Work has indeed started to incoroporate the Lius code into Tika, see
https://issues.apache.org/jira/browse/TIKA-7 and
http://incubator.apache.org/projects/tika.html

-Bertrand