Apologies if things were a little vague. Given the example snippet to index (numbered to show searches needed to match)...
1: i am a sales-manager in here 2: using asp.net and .net daily 3: working in design. 4: using something called sage 200. and i'm fluent 5: german sausages. 6: busy A&E dept earning £10,000 annually ... all with newlines in place. able to match... 1. sales 1. "sales manager" 1. sales-manager 1. "sales-manager" 2. .net 2. asp.net 3. design 4. sage 200 6. A&E 6. £10,000 But do NOT match "fluent german" from 4 + 5 since there's a newline between them when indexed, but not when searched. Do the filters (wdf in this case) not create multiple tokens, so if splitting on period in "asp.net" would create tokens for all of "asp", "asp.", "asp.net", ".net", "net". Cheers, Rob -- IntelCompute Web Design and Online Marketing http://www.intelcompute.com -----Original Message----- From: Chris Hostetter <hossman_luc...@fucit.org> Reply-to: solr-user@lucene.apache.org To: solr-user@lucene.apache.org Subject: Re: Which Tokeniser (and/or filter) Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST) : This all seems a bit too much work for such a real-world scenario? You haven't really told us what your scenerio is. You said you want to split tokens on whitespace, full-stop (aka: period) and comma only, but then in response to some suggestions you added comments other things that you never mentioned previously... 1) evidently you don't want the "." in foo.net to cause a split in tokens? 2) evidently you not only want token splits on newlines, but also positition gaps to prevent phrases matching across newlines. ...these are kind of important details that affect suggestions people might give you. can you please provide some concrete examples of hte types of data you have, the types of queries you want them to match, and the types of queries you *don't* want to match? -Hoss