Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

James Strassburg Fri, 20 Mar 2015 12:13:20 -0700

I have an autophrase configured for 'wheel chair' and if I run analysis for
'super wheel chair awesome' such that it would index to 'super wheelchair
awesome' this is how mine behaves:
http://i.imgur.com/iR4IgGp.png


When I did the implementation that is how I thought the positioning should
work. Do you think it should be different?

On Fri, Mar 20, 2015 at 11:10 AM, trhodesg <trhodes...@gmail.com> wrote:

>
>
>
>
>
>     Sorry, i can see my post is munged.
>       This seems to display it legibly
>
>
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-td4173808.html
>
>       I'm new to all this, so i hesitate to say the indexing isn't
>       correct. But my understanding is the query, "republic
>         of china", will only match
>         the indexing, republic(n) of(n+1) china(n+2)  Since
>         the original APTF indexes this as republic(n) of(n+3) china(n+7)
>       that query will fail. Wouldn't it be more logical to leave the
>       original token numbering unchanged and just add the phrase token
>       with the same number as the last word in the matched series?
>
>       BTW, i looked at your code re this. It is quite informative to a
>       newbie. Thanks!
>
>
>       On 3/19/2015 11:38 AM, James Strassburg [via Lucene] wrote:
>
>      Sorry, I've been a bit unfocused from this list for a
>       bit. When I was
>
>       working with the APTF code I rewrote a big chunk of it and didn't
>       include
>
>       the inclusion of the original tokens as I didn't need it at the
>       time. That
>
>       feature could easily be added back in. I will see if I can find a
>       bit of
>
>       time for that.
>
>
>       As for the other part of your message, are you suggesting that the
>       token
>
>       indexes are not correct? There is a bit of a formatting issue with
>       the text
>
>       and I'm not sure what you're getting at. Can you explain further
>       please?
>
>
>       On Sun, Feb 8, 2015 at 3:04 PM, trhodesg &lt; [hidden email] &gt;
>       wrote:
>
>
>         &gt; Thanks to everyone for the thought, time and effort put
>         into
>
>         &gt; AutoPhrasingTokenFilter(APTF)! It's a real lifesaver.
>
>         &gt; While trying to add APTF to my indexing, i discovered that
>         the original
>
>         &gt; (TS)
>
>         &gt; version throws an exception while indexing a 100MB PDF. The
>         error
>
>         &gt; isException writing document to the index; possible
>         analysis errorThe
>
>         &gt; modified (JS) version runs without error, but it removes
>         the tokens used to
>
>         &gt; create the phrase. They are needed.
>
>         &gt; Before looking into this i have a question; Solr would
>         normally tokenize
>
>         &gt; the
>
>         &gt; phrasethe peoples republic of china isasthe(1) peoples(2)
>         republic(3) of(4)
>
>         &gt; china(5) is(6)
>
>         &gt; Defining the APTF phrase file asthe Solr admin analysis
>         page reports that
>
>         &gt; the APTF indexer tokenizes the phrase asWould it be
>         possible for someone to
>
>         &gt; explain the reasoning behind the discontinuous token
>         numbering? As it is
>
>         &gt; now
>
>         &gt; phrase queries such as "republic of china" will fail. And i
>         can't get
>
>         &gt; proximity queries like "republic of"~10 to work either
>         (though it seems
>
>         &gt; they
>
>         &gt; should). Wouldn't it be more flexible to return the
>         following
>
>         &gt; tokenizationThis allows spurious matches such as "peoples
>         peoplesrepublic"
>
>         &gt; but it seems like this type of event would be very rare. It
>         has the
>
>         &gt; advantage of allowing phrase queries to continue working
>         the way most users
>
>         &gt; think.
>
>         &gt; Thank you for supporting more than one entity definition
>         per phrase (ie
>
>         &gt; peoplesrepublic and peoplesrepublicofchina). This is type
>         of contraction is
>
>         &gt; common in longer documents, especially when the first used
>         phrase ends with
>
>         &gt; a preposition. It helps support robust matching.
>
>         &gt;
>
>         &gt;
>
>         &gt;
>
>         &gt; --
>
>         &gt; View this message in context:
>
>         &gt;
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4184888.html
>         &gt; Sent from the Solr - User mailing list archive at
>         Nabble.com.
>
>         &gt;
>
>
>
>
>
>
>         If you reply to this email, your
>           message will be added to the discussion below:
>
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4194036.html
>
>
>         To unsubscribe from Have anyone used Automatic Phrase
>         Tokenization (AutoPhrasingTokenFilterFactory) ?, click
>           here .
>         NAML
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4194205.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?

Reply via email to