Re: Internationalization

Bess Sadler Wed, 17 Jan 2007 07:41:56 -0800


On Jan 17, 2007, at 3:07 AM, Erik Hatcher wrote:

Why are you assigning all fields to a "string" type? That indexeseach field as-is, with no tokenization at all. How are you usingthat field from the front-end? I'd think you'd want to copyFieldeverything into a "text" field.

The short answer is there is no good reason for this. I guess I justhadn't thought too hard yet about the difference between string andtext. This particular project is a gazetteer, so we're mostlyindexing proper names (e.g. "China" and "中国") which are mostly one-word and so don't need much tokenization anyway. But of course thisisn't true for all our fields, and even some proper names (e.g., "lhasa") might benefit from tokenization.

I've been planning to separately index all our Chinese text with theChineseAnalyzer (á la pages 142 - 145 in Lucene in Action) and EdGarrett (who I think is also on this list... hi, Ed!) at U Michiganis working on a Tibetan analyzer that I also want to use, I justhaven't got that far yet.

So now I'm all motivated to go re-write this thing so that it processeach language properly. Maybe I'll write something up for the wikiwhen I'm done.


Thanks again, Erik.

Bess


Elizabeth (Bess) Sadler
Head, Technical and Metadata Services
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305

Re: Internationalization

Reply via email to