On Jan 17, 2007, at 3:07 AM, Erik Hatcher wrote:

Why are you assigning all fields to a "string" type? That indexes each field as-is, with no tokenization at all. How are you using that field from the front-end? I'd think you'd want to copyField everything into a "text" field.

The short answer is there is no good reason for this. I guess I just hadn't thought too hard yet about the difference between string and text. This particular project is a gazetteer, so we're mostly indexing proper names (e.g. "China" and "中国") which are mostly one- word and so don't need much tokenization anyway. But of course this isn't true for all our fields, and even some proper names (e.g., "lha sa") might benefit from tokenization.

I've been planning to separately index all our Chinese text with the ChineseAnalyzer (á la pages 142 - 145 in Lucene in Action) and Ed Garrett (who I think is also on this list... hi, Ed!) at U Michigan is working on a Tibetan analyzer that I also want to use, I just haven't got that far yet.

So now I'm all motivated to go re-write this thing so that it process each language properly. Maybe I'll write something up for the wiki when I'm done.

Thanks again, Erik.

Bess


Elizabeth (Bess) Sadler
Head, Technical and Metadata Services
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


Reply via email to