@David Sorry for late reply. The SKG query that I am using is actually fairly basic in itself. For example,
{ > "queries":[ > "dataStoreId:\"123\"", > "text:\"foo\"" > ], > "compare":[ > { > "type":"text_shingles", > "limit":30, > "discover_values":true > } > ] > } What I am expecting is that SKG will return words/phrases that are related to the term "foo". I am filtering the text through StopWordFilter before that. I have also found that specifying a good foreground can drastically improve the results. Good luck! - Pratik On Fri, Nov 16, 2018 at 11:15 AM Alexandre Rafalovitch <arafa...@gmail.com> wrote: > Good catch Pratik. > > It is in Javadoc, but not in the reference guide: > > https://lucene.apache.org/core/6_3_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilterFactory.html > . I'll try to fix that later (SOLR-12996). > > Regards, > Alex. > On Fri, 16 Nov 2018 at 10:44, Pratik Patel <pra...@semandex.net> wrote: > > > > @Markus @Walter, @Alexandre is right. The culprit was not StopWord > Filter, > > it was ShingleFilter. I could not find parameter filterToken in > > documentation, is it a new addition? BTW, I tried that and it works. > Thanks! > > I still ended up using pattern replacement filter because I did not want > > any single word string in that field. > > > > @David I am using SKG through the plugin. So it is a POST request with > > query in body. I haven't yet upgraded to version 7.5. > > > > Thank you all for the help! > > > > Regards, > > Pratik > > > > On Fri, Nov 16, 2018 at 8:36 AM David Hastings < > hastings.recurs...@gmail.com> > > wrote: > > > > > Which function of the SKG are you using? significantTerms? > > > > > > On Thu, Nov 15, 2018 at 7:09 PM Alexandre Rafalovitch < > arafa...@gmail.com> > > > wrote: > > > > > > > I think the underscore actually comes from the Shingles (parameter > > > > fillerToken). Have you tried setting it to empty string? > > > > > > > > Regards, > > > > Alex. > > > > On Thu, 15 Nov 2018 at 17:16, Pratik Patel <pra...@semandex.net> > wrote: > > > > > > > > > > Hi Markus, > > > > > > > > > > Thanks for the reply. I tried using ShingleFilter and it seems to > > > > > be working. However, I am hitting an issue when it is used with > > > > > StopWordFilter. StopWordFilter leaves an underscore "_" for removed > > > words > > > > > and it kind of screws up the data in index. > > > > > > > > > > I tried setting enablePositionIncrements="false" for stop word > filter > > > but > > > > > that parameter only works for lucene version 4.3 or earlier. Looks > like > > > > > it's an open issue in lucene > > > > > https://issues.apache.org/jira/browse/LUCENE-4065 > > > > > > > > > > For now, I am trying to find a workaround using > > > > PatternReplaceFilterFactory. > > > > > > > > > > Regards, > > > > > Pratik > > > > > > > > > > On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma < > > > > markus.jel...@openindex.io> > > > > > wrote: > > > > > > > > > > > Hello Pratik, > > > > > > > > > > > > We would use ShingleFilter for this indeed. If you only want > > > > > > bigrams/shingles, don't forget to disable outputUnigrams and set > both > > > > > > shinle size limits to 2. > > > > > > > > > > > > Regards, > > > > > > Markus > > > > > > > > > > > > -----Original message----- > > > > > > > From:Pratik Patel <pra...@semandex.net> > > > > > > > Sent: Thursday 15th November 2018 17:00 > > > > > > > To: solr-user@lucene.apache.org > > > > > > > Subject: Extracting important multi term phrases from the text > > > > > > > > > > > > > > Hello Everyone, > > > > > > > > > > > > > > Standard way of tokenizing in solr would divide the text by > white > > > > space > > > > > > in > > > > > > > solr. > > > > > > > > > > > > > > Is there a way by which we can index multi-term phrases like > > > "Machine > > > > > > > Learning" instead of "Machine", "Learning"? > > > > > > > Is it possible to create a specific field type for such phrases > > > > which has > > > > > > > its own indexing pipeline? I am open to storing n-grams but > these > > > > n-grams > > > > > > > would be across terms and not just one term? In other words, I > > > don't > > > > want > > > > > > > to store n-grams of the term "machine", I want to store n-grams > > > for a > > > > > > > sentence like below. > > > > > > > > > > > > > > "I like machine learning" --> "I like", "like machine", > "machine > > > > > > learning" > > > > > > > and so on..... > > > > > > > > > > > > > > It seems like Shingle Filter ( > > > > > > > > > > > > > > > > > > > > > https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter > > > > > > ) > > > > > > > may be used for this. Is there a better alternative? > > > > > > > > > > > > > > I want to use this field as an input to Semantic Knowledge > Graph. > > > The > > > > > > > plugin works great for words. But now I want to use it for > phrases. > > > > Any > > > > > > > idea around this would be really helpful. > > > > > > > > > > > > > > Thanks a lot! > > > > > > > > > > > > > > - Pratik > > > > > > > > > > > > > > > > > > > > >