Re: Extracting important multi term phrases from the text

Pratik Patel Tue, 20 Nov 2018 08:32:23 -0800

@David Sorry for late reply. The SKG query that I am using is actually
fairly basic in itself.  For example,


{
> "queries":[
> "dataStoreId:\"123\"",
>                 "text:\"foo\""
> ],
> "compare":[
> {
> "type":"text_shingles",
> "limit":30,
> "discover_values":true
> }
> ]
> }


What I am expecting is that SKG will return words/phrases that are related
to the term "foo". I am filtering the text through StopWordFilter before
that. I have also found that specifying a good foreground can drastically
improve the results.

Good luck!

- Pratik

On Fri, Nov 16, 2018 at 11:15 AM Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> Good catch Pratik.
>
> It is in Javadoc, but not in the reference guide:
>
> https://lucene.apache.org/core/6_3_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilterFactory.html
> . I'll try to fix that later (SOLR-12996).
>
> Regards,
>    Alex.
> On Fri, 16 Nov 2018 at 10:44, Pratik Patel <pra...@semandex.net> wrote:
> >
> > @Markus @Walter,  @Alexandre is right. The culprit was not StopWord
> Filter,
> > it was ShingleFilter. I could not find parameter filterToken in
> > documentation, is it a new addition? BTW, I tried that and it works.
> Thanks!
> > I still ended up using pattern replacement filter because I did not want
> > any single word string in that field.
> >
> > @David I am using SKG through the plugin. So it is a POST request with
> > query in body. I haven't yet upgraded to version 7.5.
> >
> > Thank you all for the help!
> >
> > Regards,
> > Pratik
> >
> > On Fri, Nov 16, 2018 at 8:36 AM David Hastings <
> hastings.recurs...@gmail.com>
> > wrote:
> >
> > > Which function of the SKG are you using?  significantTerms?
> > >
> > > On Thu, Nov 15, 2018 at 7:09 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> > > wrote:
> > >
> > > > I think the underscore actually comes from the Shingles (parameter
> > > > fillerToken). Have you tried setting it to empty string?
> > > >
> > > > Regards,
> > > >    Alex.
> > > > On Thu, 15 Nov 2018 at 17:16, Pratik Patel <pra...@semandex.net>
> wrote:
> > > > >
> > > > > Hi Markus,
> > > > >
> > > > > Thanks for the reply. I tried using ShingleFilter and it seems to
> > > > > be working. However, I am hitting an issue when it is used with
> > > > > StopWordFilter. StopWordFilter leaves an underscore "_" for removed
> > > words
> > > > > and it kind of screws up the data in index.
> > > > >
> > > > > I tried setting enablePositionIncrements="false" for stop word
> filter
> > > but
> > > > > that parameter only works for lucene version 4.3 or earlier. Looks
> like
> > > > > it's an open issue in lucene
> > > > > https://issues.apache.org/jira/browse/LUCENE-4065
> > > > >
> > > > > For now, I am trying to find a workaround using
> > > > PatternReplaceFilterFactory.
> > > > >
> > > > > Regards,
> > > > > Pratik
> > > > >
> > > > > On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma <
> > > > markus.jel...@openindex.io>
> > > > > wrote:
> > > > >
> > > > > > Hello Pratik,
> > > > > >
> > > > > > We would use ShingleFilter for this indeed. If you only want
> > > > > > bigrams/shingles, don't forget to disable outputUnigrams and set
> both
> > > > > > shinle size limits to 2.
> > > > > >
> > > > > > Regards,
> > > > > > Markus
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:Pratik Patel <pra...@semandex.net>
> > > > > > > Sent: Thursday 15th November 2018 17:00
> > > > > > > To: solr-user@lucene.apache.org
> > > > > > > Subject: Extracting important multi term phrases from the text
> > > > > > >
> > > > > > > Hello Everyone,
> > > > > > >
> > > > > > > Standard way of tokenizing in solr would divide the text by
> white
> > > > space
> > > > > > in
> > > > > > > solr.
> > > > > > >
> > > > > > > Is there a way by which we can index multi-term phrases like
> > > "Machine
> > > > > > > Learning" instead of "Machine", "Learning"?
> > > > > > > Is it possible to create a specific field type for such phrases
> > > > which has
> > > > > > > its own indexing pipeline? I am open to storing n-grams but
> these
> > > > n-grams
> > > > > > > would be across terms and not just one term? In other words, I
> > > don't
> > > > want
> > > > > > > to store n-grams of the term "machine", I want to store n-grams
> > > for a
> > > > > > > sentence like below.
> > > > > > >
> > > > > > > "I like machine learning" --> "I like", "like machine",
> "machine
> > > > > > learning"
> > > > > > > and so on.....
> > > > > > >
> > > > > > > It seems like Shingle Filter (
> > > > > > >
> > > > > >
> > > >
> > >
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter
> > > > > > )
> > > > > > > may be used for this. Is there a better alternative?
> > > > > > >
> > > > > > > I want to use this field as an input to Semantic Knowledge
> Graph.
> > > The
> > > > > > > plugin works great for words. But now I want to use it for
> phrases.
> > > > Any
> > > > > > > idea around this would be really helpful.
> > > > > > >
> > > > > > > Thanks a lot!
> > > > > > >
> > > > > > > - Pratik
> > > > > > >
> > > > > >
> > > >
> > >
>

Re: Extracting important multi term phrases from the text

Reply via email to