Re: Concatenate multiple tokens into one
Hi Robert, All, I have a similar problem, here is my fieldType, http://paste.pocoo.org/show/289910/ I want to include stopword removal and lowercase the incoming terms. The idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the EdgeNgram filter factory. If anyone can tell me a simple way to concatenate tokens into one token again, similar too the KeyWordTokenizer that would be super helpful. Many thanks Nick On 11 Nov 2010, at 00:23, Robert Gründler wrote: > > On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: > >> Are you sure you really want to throw out stopwords for your use case? I >> don't think autocompletion will work how you want if you do. > > in our case i think it makes sense. the content is targetting the electronic > music / dj scene, so we have a lot of words like "DJ" or "featuring" which > make sense to throw out of the query. Also searches for "the beastie boys" > and "beastie boys" should return a match in the autocompletion. > >> >> And if you don't... then why use the WhitespaceTokenizer and then try to jam >> the tokens back together? Why not just NOT tokenize in the first place. Use >> the KeywordTokenizer, which really should be called the >> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just creates >> one token from the entire input string. > > I started out with the KeywordTokenizer, which worked well, except the > StopWord problem. > > For now, i've come up with a quick-and-dirty custom "ConcatFilter", which > does what i'm after: > > public class ConcatFilter extends TokenFilter { > > private TokenStream tstream; > > protected ConcatFilter(TokenStream input) { > super(input); > this.tstream = input; > } > > @Override > public Token next() throws IOException { > > Token token = new Token(); > StringBuilder builder = new StringBuilder(); > > TermAttribute termAttribute = (TermAttribute) > tstream.getAttribute(TermAttribute.class); > TypeAttribute typeAttribute = (TypeAttribute) > tstream.getAttribute(TypeAttribute.class); > > boolean incremented = false; > > while (tstream.incrementToken()) { > > if (typeAttribute.type().equals("word")) { > builder.append(termAttribute.term()); > > } > incremented = true; > } > > token.setTermBuffer(builder.toString()); > > if (incremented == true) > return token; > > return null; > } > } > > I'm not sure if this is a safe way to do this, as i'm not familar with the > whole solr/lucene implementation after all. > > > best > > > -robert > > > > >> >> Then lowercase, remove whitespace (or not), do whatever else you want to do >> to your single token to normalize it, and then edgengram it. >> >> If you include whitespace in the token, then when making your queries for >> auto-complete, be sure to use a query parser that doesn't do >> "pre-tokenization", the 'field' query parser should work well for this. >> >> Jonathan >> >> >> >> >> From: Robert Gründler [rob...@dubture.com] >> Sent: Wednesday, November 10, 2010 6:39 PM >> To: solr-user@lucene.apache.org >> Subject: Concatenate multiple tokens into one >> >> Hi, >> >> i've created the following filterchain in a field type, the idea is to use >> it for autocompletion purposes: >> >> >> >> > words="stopwords.txt" enablePositionIncrements="true" /> >> > replacement="" replace="all" /> >> >> >> >> > /> >> >> With that kind of filterchain, the EdgeNGramFilterFactory will receive >> multiple tokens on input strings with whitespaces in it. This leads to the >> following results: >> Input Query: "George Cloo" >> Matches: >> - "George Harrison" >> - "John Clooridge" >> - "George Smith" >> -"George Clooney" >> - etc >> >> However, only "George Clooney" should match in the autocompletion use case. >> Therefore, i'd like to add a filter before the EdgeNGramFilterFactory, which >> concatenates all the tokens generated by the WhitespaceTokenizerFactory. >> Are there filters which can do such a thing? >> >> If not, are there examples how to implement a custom TokenFilter? >> >> thanks! >> >> -robert >> >> >> >> >
Re: Concatenate multiple tokens into one
Thanks Robert, I had been trying to get your ConcatFilter to work, but I'm not sure what i need in the classpath and where Token comes from. Will check the thread you mention. Best Nick On 11 Nov 2010, at 18:13, Robert Gründler wrote: > I've posted a ConcaFilter in my previous mail which does concatenate tokens. > This works fine, but i > realized that what i wanted to achieve is implemented easier in another way > (by using 2 separate field types). > > Have a look at a previous mail i wrote to the list and the reply from Ahmet > Arslan (topic: "EdgeNGram relevancy). > > > best > > > -robert > > > > > See > On Nov 11, 2010, at 5:27 PM, Nick Martin wrote: > >> Hi Robert, All, >> >> I have a similar problem, here is my fieldType, >> http://paste.pocoo.org/show/289910/ >> I want to include stopword removal and lowercase the incoming terms. The >> idea being to take, "Foo Bar Baz Ltd" and turn it into "foobarbaz" for the >> EdgeNgram filter factory. >> If anyone can tell me a simple way to concatenate tokens into one token >> again, similar too the KeyWordTokenizer that would be super helpful. >> >> Many thanks >> >> Nick >> >> On 11 Nov 2010, at 00:23, Robert Gründler wrote: >> >>> >>> On Nov 11, 2010, at 1:12 AM, Jonathan Rochkind wrote: >>> >>>> Are you sure you really want to throw out stopwords for your use case? I >>>> don't think autocompletion will work how you want if you do. >>> >>> in our case i think it makes sense. the content is targetting the >>> electronic music / dj scene, so we have a lot of words like "DJ" or >>> "featuring" which >>> make sense to throw out of the query. Also searches for "the beastie boys" >>> and "beastie boys" should return a match in the autocompletion. >>> >>>> >>>> And if you don't... then why use the WhitespaceTokenizer and then try to >>>> jam the tokens back together? Why not just NOT tokenize in the first >>>> place. Use the KeywordTokenizer, which really should be called the >>>> NonTokenizingTokenizer, becaues it doesn't tokenize at all, it just >>>> creates one token from the entire input string. >>> >>> I started out with the KeywordTokenizer, which worked well, except the >>> StopWord problem. >>> >>> For now, i've come up with a quick-and-dirty custom "ConcatFilter", which >>> does what i'm after: >>> >>> public class ConcatFilter extends TokenFilter { >>> >>> private TokenStream tstream; >>> >>> protected ConcatFilter(TokenStream input) { >>> super(input); >>> this.tstream = input; >>> } >>> >>> @Override >>> public Token next() throws IOException { >>> >>> Token token = new Token(); >>> StringBuilder builder = new StringBuilder(); >>> >>> TermAttribute termAttribute = (TermAttribute) >>> tstream.getAttribute(TermAttribute.class); >>> TypeAttribute typeAttribute = (TypeAttribute) >>> tstream.getAttribute(TypeAttribute.class); >>> >>> boolean incremented = false; >>> >>> while (tstream.incrementToken()) { >>> >>> if (typeAttribute.type().equals("word")) { >>> builder.append(termAttribute.term()); >>> >>> } >>> incremented = true; >>> } >>> >>> token.setTermBuffer(builder.toString()); >>> >>> if (incremented == true) >>> return token; >>> >>> return null; >>> } >>> } >>> >>> I'm not sure if this is a safe way to do this, as i'm not familar with the >>> whole solr/lucene implementation after all. >>> >>> >>> best >>> >>> >>> -robert >>> >>> >>> >>> >>>> >>>> Then lowercase, remove whitespace (or not), do whatever else you want to >>>> do to your single token to no
Re: EdgeNGram relevancy
On 12 Nov 2010, at 01:46, Ahmet Arslan wrote: >> This setup now makes troubles regarding StopWords, here's >> an example: >> >> Let's say the index contains 2 Strings: "Mr Martin >> Scorsese" and "Martin Scorsese". "Mr" is in the stopword >> list. >> >> Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0 >> >> This way, the only result i get is "Mr Martin Scorsese", >> because the strict field edgytext2 is boosted by 2.0. >> >> Any idea why in this case "Martin Scorsese" is not in the >> result at all? > > Did you run your query without using () and "" operators? If yes can you try > this? > &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0 > > If no can you paste output of &debugQuery=on > > > This would still not deal with the problem of removing stop words from the indexing and query analysis stages. I really need something that will allow that and give a single token as in the example below. Best Nick
Re: synonyms not working with copyfield
Hi, You could use a copyField against all fields and then AND the query terms given. Quite restrictive but all terms would then have to be present to match. I'm still a relative newbie to Solr so perhaps I'm horribly wrong. Cheers Nick On 13 May 2010, at 18:18, surajit wrote: > > Understood and I can work with that limitation by using separate > fields during indexing. However, my search interface is just a text > box like Google and I need to take the query and return only those > documents that match ALL terms in the query and if I am going to take > the query and match it against each field (separately), how do I get > back documents matching all user terms? One soln I can think of is to > take all the field-specific analysis out of solr and do it as a > pre-process step, but want to make sure there isn't an alternative > within Solr. > > surajit > > On Thu, May 13, 2010 at 12:42 PM, Chris Hostetter-3 [via Lucene] > wrote: >> : which is good, but the different fields that I copy into the copyfield >> need >> : different analysis and I no longer am able to do that. I can, of course, >> >> Fundementally, Solr can only apply a single analysis chain to all of >> the text in a given field -- regardless of where it may be copied from. >> if it didn't, there would be no way to get matches at query time. >> >> the query analysis has to "make sense" for the index analysis, so it has >> to be consistent. >> >> >> >> -Hoss >> >> >> >> >> View message @ >> http://lucene.472066.n3.nabble.com/synonyms-not-working-with-copyfield-tp814108p815302.html >> To unsubscribe from Re: synonyms not working with copyfield, click here. >> > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/synonyms-not-working-with-copyfield-tp814108p815426.html > Sent from the Solr - User mailing list archive at Nabble.com.