Re: Multi-language indexing and searching

2007-06-09 Thread Henrib

Hi Daniel,
Trying to recap: you are indexing documents that can be in different
language. On the query side, users will only search in one language at a
time & get results in that language.

Setting aside the webapp deployment problem, the alternative is thus:
option1: 1 schema will all fields of all languages pre-defined
option2: 1 schema per lang with the same field names (but a different type).

You indicate that your documents do have a field carrying the language. Is
the Solr document format the authoring format of the documents you index or
do they require some pre-processing to extract those fields? For instance,
are the source documents in HTML and pre-processed using some XPath/magic to
generate the fields?
In that case, using option1, the pre-processing transformation needs to know
which fields to generate according to the language. Option2 needs you to
know which core you need to target based on the lang. And it goes the same
way for querying; for option1, you need a query with different fields for
each language, option2 requires to target the correct core.
In the other case, ie if the Solr document format is the source format,
indexing requires some script (curl or else) to send them to Solr; having
the script determine which core to target don't seem (from far) a hard task
(grep/awk  to the rescue :-)).

On the maintenance side, if you were to change the schema, need to reindex
one lang or add a lang, option1 seems to have a 'wider' impact, the
functional grain being coarser. Besides, if your collections are huge or
grow fast, it might be nice to have an easy way to partition the workload on
different machines which seems easier with option2, directing indexing &
queries to a site based on the lang.

On the webapp deployment side, option1 is a breeze, option2 requires
multiple web-app (Forgetting solr-215 patch that is unlikely to be reviewed
and accepted soon since its functional value is not shared).

Hope this helps in your choice, regards,
Henri







Daniel Alheiros wrote:
> 
> Hi Henri.
> 
> Thanks for your reply.
> I've just looked at the patch you referred, but doing this I will lose the
> out of the box Solr installation... I'll have to create my own Solr
> application responsible for creating the multiple cores and I'll have to
> change my indexing process to something able to notify content for a
> specific core.
> 
> Can't I have the same index, using one single core, same field names being
> processed by language specific components based on a field/parameter?
> 
> I will try to draw what I'm thinking, please forgive me if I'm not using
> the
> correct terms but I'm not an IR expert.
> 
> Thinking in a workflow:
> Indexing:
> Multilanguage indexer receives some documents
> for each document, verify the "language" field
> if language = "English" then process using the
> EnglishIndexer
> else if language = "Chinese" then process using the
> ChineseIndexer
> else if ...
> 
> Querying:
> Multilanguage Request Handler receives a request
> if parameter language = "English" then process using the
> English
> Request Handler
> else if parameter language = "Chinese" then process using the
> Chinese Request Handler
> else if ...
> 
> I can see that in the schema field definitions, we have some language
> dependent parameters... It can be a problem, as I would like to have the
> same fields for all requests...
> 
> Sorry to bother, but before I split all my data this way I would like to
> be
> sure that it's the best approach for me.
> 
> Regards,
> Daniel
> 
> 
> On 8/6/07 15:15, "Henrib" <[EMAIL PROTECTED]> wrote:
> 
>> 
>> Hi Daniel,
>> If it is functionally 'ok' to search in only one lang at a time, you
>> could
>> try having one index per lang. Each per-lang index would have one schema
>> where you would describe field types (the lang part coming through
>> stemming/snowball analyzers, per-lang stopwords & al) and the same field
>> name could be used in each of them.
>> You could either deploy that solution through multiple web-apps (one per
>> lang) (or try the patch for issue Solr-215).
>> Regards,
>> Henri
>> 
>> 
>> Daniel Alheiros wrote:
>>> 
>>> Hi, 
>>> 
>>> I'm just starting to use Solr and so far, it has been a very interesting
>>> learning process. I wasn't a Lucene user, so I'm learning a lot about
>>> both.
>>> 
>>> My problem is:
>>> I have to index and search content in several languages.
>>> 
>>> My scenario is a bit different from other that I've already read in this
>>> forum, as my client is the same to search any language and it could be
>>> accomplished using a field to define language.
>>> 
>>> My questions are more focused on how to keep the benefits of all the
>>> protwords, stopwords and synonyms in a multilanguage situation
>>> 
>>> Should I create new Analyzers that can deal with the "language" field of
>>> the
>>> document? What

Re: How can I use dates to boost my results?

2007-06-09 Thread Nick Jenkin

Hi Daniel
You can use a boosting function,

In the dismax request handler insert the following:


   recip(rord(created),1,1000,1000)


Obviously you will need to modify the values a bit, more info here:
http://wiki.apache.org/solr/FunctionQuery

-Nick

On 6/9/07, Daniel Alheiros <[EMAIL PROTECTED]> wrote:

Hi

For my search use, the document freshness is a relevant aspect that should
be considered to boost results.

I have a field in my index like this:



How can I make a good use of this to boost my results?

I'm using the DisMaxRequestHandler to boost other textual fields based on
the query, but it would improve the results quality a lot if the date where
considered to define the score.


Best Regards,
Daniel


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.




Re: Wildcards / Binary searches

2007-06-09 Thread Frédéric Glorieux

Hi Chris,

The skills on this list are really very stimulating. I'm sad but I will 
probably not be able to contribute. Solr may not be the choosen 
technology of the project I'm working on, because of server 
administration issues (java). I know that there is no performances 
arguments (lucene is incredible, and solr is nicely close to it), but 
that's real life. So I will not find time for the idea below.


> : project, definitively not a good practice for portability of indexes. A
> : duplicate field with an analyser to produce a sortable ASCII version
> : would be better.
>
> exactly ... I think conceptually the methodology for solving the problem
> is very similar to the way the SpellChecker contrib works: use a very
> custom index designed for the application (not just look at the terms in
> the main corpus) and custom logic for using that index.

It could be a useful request handler ? Giving a field, with a 
displayable stored value, and a sortable indexed one, you need the 
analyser to parse the user entry, build a term with it, and get very 
fastly a pointer to the internal lucene index, exactly at the best 
place, for w, wo, wor or word. From the iterator you can display a 
suggest list, it's also possible to get one or more docs directly 
attached, for example to display a count. It seems interesting for 
things like, a topic or an author of a doc ?



: Do you mean something like below ?
: w wo wor word

yeah, but there are some Tokenizers that make this trivial
(EdgeNGramTokenizer i think is the name)





--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique


Re: To make sure XML is UTF-8

2007-06-09 Thread Tiong Jeffrey

This is how the whole process looks like -

1. I have a web page that I want to index. So I first copy that web page,
breaking it down to different section, and store it in mysql into different
column
2. I then wrote a small PHP script that draw all the value from all the
fields from mysql and then write it into an xml file
3. I then use solr to index this xml file, and the error that appears half
way during indexing is - "FATAL: Connection error (is Solr running at
http://localhost/solr/update
?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
http://local/solr/update";
4.Although the error code doesnt specify is XML utf-8 code error, but I did
a bit research, and look at the XML file that i have, it doesn't fulfill the
utf-8 encoding

I have been trying these for couple of hours, but still to no avail. I would
like to find out
1. How to know the webpage that I copy into my mysql is what coding?
2. at what point of this whole process should I convert it to UTF-8? I tried
change the collation in mysql for all the columns to UTF-8 from
latin1-swedish, but it still doesnt work

Thanks

On 6/9/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:


> Thought this is not directly related to Solr, but I have a XML output
from
> mysql database, but during indexing the XML output is not working. And
the
> problem is part of the XML output is not in UTF-8 encoding, how can I
> convert it to UTF-8 and how do I know what kind of coding it uses in the
> first place (the data I export from the mysql database). Thanks!

How do you generate XML output? "Output" itself is usually a raw byte
array, it uses "Transport" and "Encoding". If you save it in a file
system and forget about "transport-layer-encoding" you will get some
new problems...

> during indexing the XML output is not working
- what exactly happens, which kind of error messages?





How to use SolrQuery PHP

2007-06-09 Thread Tiong Jeffrey

Hi all,

I am trying to send a query to Solr from my PHP script and retrieve the
results. I found this script on the wiki http://wiki.apache.org/solr/SolPHP

I tried to use it but I guess I didn't use it correctly, there wasn't any
result appear (blank). Below is my simple code to use the Solrquery class.

  $query = new SolrQuery;

   $query->limit = 10;
   $query->group_id = 1;

   $results = $query->runQuery(water);
   echo $results;

Can anyone tell me what is wrong with this? And if there is any other way to
do this? Thanks!


Re: To make sure XML is UTF-8

2007-06-09 Thread Ken Krugler

This is how the whole process looks like -

1. I have a web page that I want to index. So I first copy that web page,
breaking it down to different section, and store it in mysql into different
column
2. I then wrote a small PHP script that draw all the value from all the
fields from mysql and then write it into an xml file
3. I then use solr to index this xml file, and the error that appears half
way during indexing is - "FATAL: Connection error (is Solr running at
http://localhost/solr/update
?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
http://local/solr/update";
4.Although the error code doesnt specify is XML utf-8 code error, but I did
a bit research, and look at the XML file that i have, it doesn't fulfill the
utf-8 encoding

I have been trying these for couple of hours, but still to no avail. I would
like to find out
1. How to know the webpage that I copy into my mysql is what coding?


The charset can be in the response header, and/or the meta tags for 
the page. See 
http://krugle.com/kse/files/svn/svn.apache.org/lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 
for code used by Nutch for this.


Or it could be missing from both. Or it could be wrong for either/both.

The issue of determining the right charset for an arbitrary web page 
isn't an easy one. If you have some way of doing analysis in advance 
such that you know for sure it's always X, that's going to simplify 
things for you.



2. at what point of this whole process should I convert it to UTF-8?


As soon as possible - which means right when you're processing the page.


I tried
change the collation in mysql for all the columns to UTF-8 from
latin1-swedish, but it still doesnt work


Collation settings in the DB change how the DB interprets the data, 
but it doesn't change the data itself.


-- Ken



On 6/9/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:



 Thought this is not directly related to Solr, but I have a XML output

from

 mysql database, but during indexing the XML output is not working. And

the

 problem is part of the XML output is not in UTF-8 encoding, how can I
 convert it to UTF-8 and how do I know what kind of coding it uses in the
 first place (the data I export from the mysql database). Thanks!


How do you generate XML output? "Output" itself is usually a raw byte
array, it uses "Transport" and "Encoding". If you save it in a file
system and forget about "transport-layer-encoding" you will get some
new problems...


 during indexing the XML output is not working

- what exactly happens, which kind of error messages?



--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"


Re: How to use SolrQuery PHP

2007-06-09 Thread Nick Jenkin

Hi Tiong
My suggestion would be to write your own using the SolrQuery script as a guide.

But did you change define('SOLR_META_QUERY', '127.0.0.1:8080');
so that it points to your solr server? (Which is most likely
define('SOLR_META_QUERY', 'localhost:8983');

-Nick
On 6/10/07, Tiong Jeffrey <[EMAIL PROTECTED]> wrote:

Hi all,

I am trying to send a query to Solr from my PHP script and retrieve the
results. I found this script on the wiki http://wiki.apache.org/solr/SolPHP

I tried to use it but I guess I didn't use it correctly, there wasn't any
result appear (blank). Below is my simple code to use the Solrquery class.

   $query = new SolrQuery;

$query->limit = 10;
$query->group_id = 1;

$results = $query->runQuery(water);
echo $results;

Can anyone tell me what is wrong with this? And if there is any other way to
do this? Thanks!



Re: To make sure XML is UTF-8

2007-06-09 Thread Nick Jenkin

2. I then wrote a small PHP script that draw all the value from all the
fields from mysql and then write it into an xml file


You might find the utf8_encode & utf8_decode php functions useful,
http://nz2.php.net/utf8_encode
http://nz2.php.net/utf8_decode

$utf8string = utf8_encode($row['column']);

-Nick

On 6/10/07, Ken Krugler <[EMAIL PROTECTED]> wrote:

>This is how the whole process looks like -
>
>1. I have a web page that I want to index. So I first copy that web page,
>breaking it down to different section, and store it in mysql into different
>column
>2. I then wrote a small PHP script that draw all the value from all the
>fields from mysql and then write it into an xml file
>3. I then use solr to index this xml file, and the error that appears half
>way during indexing is - "FATAL: Connection error (is Solr running at
>http://localhost/solr/update
>?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
>http://local/solr/update";
>4.Although the error code doesnt specify is XML utf-8 code error, but I did
>a bit research, and look at the XML file that i have, it doesn't fulfill the
>utf-8 encoding
>
>I have been trying these for couple of hours, but still to no avail. I would
>like to find out
>1. How to know the webpage that I copy into my mysql is what coding?

The charset can be in the response header, and/or the meta tags for
the page. See
http://krugle.com/kse/files/svn/svn.apache.org/lucene/nutch/trunk/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
for code used by Nutch for this.

Or it could be missing from both. Or it could be wrong for either/both.

The issue of determining the right charset for an arbitrary web page
isn't an easy one. If you have some way of doing analysis in advance
such that you know for sure it's always X, that's going to simplify
things for you.

>2. at what point of this whole process should I convert it to UTF-8?

As soon as possible - which means right when you're processing the page.

>I tried
>change the collation in mysql for all the columns to UTF-8 from
>latin1-swedish, but it still doesnt work

Collation settings in the DB change how the DB interprets the data,
but it doesn't change the data itself.

-- Ken


>On 6/9/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>>
>>>  Thought this is not directly related to Solr, but I have a XML output
>>from
>>>  mysql database, but during indexing the XML output is not working. And
>>the
>>>  problem is part of the XML output is not in UTF-8 encoding, how can I
>>>  convert it to UTF-8 and how do I know what kind of coding it uses in the
>>>  first place (the data I export from the mysql database). Thanks!
>>
>>How do you generate XML output? "Output" itself is usually a raw byte
>>array, it uses "Transport" and "Encoding". If you save it in a file
>>system and forget about "transport-layer-encoding" you will get some
>>new problems...
>>
>>>  during indexing the XML output is not working
>>- what exactly happens, which kind of error messages?


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"



Re: To make sure XML is UTF-8

2007-06-09 Thread Chris Hostetter
: way during indexing is - "FATAL: Connection error (is Solr running at
: http://localhost/solr/update
: ?): java.io.IOException: Server returned HTTP Response code: 500 for URL:
: http://local/solr/update";
: 4.Although the error code doesnt specify is XML utf-8 code error, but I did
: a bit research, and look at the XML file that i have, it doesn't fulfill the
: utf-8 encoding

I *strongly* encourage you to look at the body of the response and/or the
error log of your Servlet container and find out *exactly* what the cause
of the error is ... you could spend a lot of time working on this and
discover it's not your real problem.



-Hoss


Re: Wildcards / Binary searches

2007-06-09 Thread Chris Hostetter

: It could be a useful request handler ? Giving a field, with a

perhaps, but as i said -- i think it requires more then just a special
request handler, you want a special index as well.

FYI: there is an ongoing thread on this general topic on the java-user
list, i didn't have the time/energy to follow it but the concepts
discussed there might prove interesting for you (most of the people
involved have spent a lot more time on problems like this then i have)...

http://www.nabble.com/How-to-implement-AJAX-search%7ELucene-Search-part--tf3887286.html




-Hoss



Re: How to use SolrQuery PHP

2007-06-09 Thread Tiong Jeffrey

Yes I did that. Ya, maybe I should study the code in detailed! Thanks!

On 6/10/07, Nick Jenkin <[EMAIL PROTECTED]> wrote:


Hi Tiong
My suggestion would be to write your own using the SolrQuery script as a
guide.

But did you change define('SOLR_META_QUERY', '127.0.0.1:8080');
so that it points to your solr server? (Which is most likely
define('SOLR_META_QUERY', 'localhost:8983');

-Nick
On 6/10/07, Tiong Jeffrey <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I am trying to send a query to Solr from my PHP script and retrieve the
> results. I found this script on the wiki
http://wiki.apache.org/solr/SolPHP
>
> I tried to use it but I guess I didn't use it correctly, there wasn't
any
> result appear (blank). Below is my simple code to use the Solrquery
class.
>
>$query = new SolrQuery;
>
> $query->limit = 10;
> $query->group_id = 1;
>
> $results = $query->runQuery(water);
> echo $results;
>
> Can anyone tell me what is wrong with this? And if there is any other
way to
> do this? Thanks!
>



Re: To make sure XML is UTF-8

2007-06-09 Thread Tiong Jeffrey

Ya you are right! After I change it to UTF-8 the error still there... I
looked at the log, this is what it appears,

127.0.0.1 -  -  [10/06/2007:03:52:06 +] "POST /solr/update HTTP/1.1" 500
4022

I tried to search but couldn't understand what error is this, anybody has
any idea on this?

Thanks!!!

On 6/10/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: way during indexing is - "FATAL: Connection error (is Solr running at
: http://localhost/solr/update
: ?): java.io.IOException: Server returned HTTP Response code: 500 for
URL:
: http://local/solr/update";
: 4.Although the error code doesnt specify is XML utf-8 code error, but I
did
: a bit research, and look at the XML file that i have, it doesn't fulfill
the
: utf-8 encoding

I *strongly* encourage you to look at the body of the response and/or the
error log of your Servlet container and find out *exactly* what the cause
of the error is ... you could spend a lot of time working on this and
discover it's not your real problem.



-Hoss



Re: Wildcards / Binary searches

2007-06-09 Thread Frédéric Glorieux

Chris Hostetter a écrit :

: It could be a useful request handler ? Giving a field, with a

perhaps, but as i said -- i think it requires more then just a special
request handler, you want a special index as well.

FYI: there is an ongoing thread on this general topic on the java-user
list, i didn't have the time/energy to follow it but the concepts
discussed there might prove interesting for you (most of the people
involved have spent a lot more time on problems like this then i have)...

http://www.nabble.com/How-to-implement-AJAX-search%7ELucene-Search-part--tf3887286.html


Interesting, here is my idea : "WildcardTermEnum (NOT query)"




--
Frédéric Glorieux
École nationale des chartes
direction des nouvelles technologies et de l'informatique