Thank you Mary.

You are right about the punctuation creating the problem. Words and stems 
without punctuation work just fine with my new custom dictionary.

We have a long list of industry-specific abbreviations and synonyms. I am 
experimenting with using stemming (instead of a thesaurus) to make searches 
return the same results regardless of whether the user searches for the full 
word or the abbreviation. Most of these abbreviations contain punctuation 
(either a period or an apostrophe). All these values are in the same field. So 
I'll take your advice and investigate creating a custom tokenization for that 
field.

Thank you,
David

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Mary Holstege
Sent: Wednesday, July 22, 2015 11:57 AM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Custom dictionary for stemming


It may be a tokenization thing -- the apostrophe is causing a word break so 
your custom stem is never matched.

What does this give you: cts:tokenize(cts:stem("Int'l"))?

Do things work as you expect for a custom stem that doesn't have a punctuation 
character in it?

A workaround for that is to create a field custom tokenization override making 
apostrophe a word character. That will be confined to that specific field, 
however, and not to word queries in general.

Regardless, you should probably report a bug to ML support.

//Mary

On Wed, 22 Jul 2015 08:02:33 -0700, Rhodes, David (LNG-CON) 
<[email protected]> wrote:

> I am trying to use a custom dictionary to extend the set of stemmed 
> words.
>
> I am using MarkLogic 7.0, and have been following the documentation 
> guides in Chapters 17 and 18:
> http://docs.marklogic.com/7.0/guide/search-dev/stemming
> http://docs.marklogic.com/7.0/guide/search-dev/custom-dictionaries
>
> I noted that there are two ways to see if words are resolving to their
> stems:
>
> cts:stem(word) returns the stems of word
>
> and
>
> cts:contains(word, stem) returns true if these two terms resolve to 
> the same stem
>
> I confirmed that both of these work for terms that are in the default 
> dictionary (e.g., run and running, bite and bitten)
>
> I have added a custom dictionary that adds "Int'l" as a word with 
> "International" as its stem.
>
> cdict:dictionary-write("en",$dict)
>
> With that dictionary added as the custom dictionary for English, 
> cts:stem works but cts:contains does not.
> cts:stem("Int'l") returns International cts:contains("Int'l", 
> "International") returns false
>
> I reindexed my database, since I understand that my dictionary entry 
> means that all documents containing "Int'l" should now be indexed 
> under "International".
>
> cts:contains("Int'l", "International") still returns false 
> Furthermore, in the real search work flow that I am doing, searches 
> for "Int'l" do not return documents containing "International" (But 
> searches for "bitten" do return documents containing "bite").
>
> My database indexes are set to Stemmed Searches = Basic, and Word 
> Searches = False.
>
> I think that stemming can be a powerful feature for my work flow, if I 
> can just get it to work. Thank you for any advice you can offer.
>
> David


--
Using Opera's revolutionary email client: http://www.opera.com/mail/ 
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to