You really need to have this page as a handy reference..... http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>Look in particular at what happens with WordDelimiterFilterFactory, you're breaking your tokens up on non-alpha characters and case change and letter<->number transitions. Then you're asking that things "of a kind" be put back into words. You might try StandardTokenizerFactory instead.... Erick On Wed, Jan 20, 2010 at 12:55 PM, Bogdan Vatkov <bogdan.vat...@gmail.com>wrote: > that is the field type: > <fieldType name="body_text" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <!-- in this example, we will only use synonyms at query time > <filter class="solr.SynonymFilterFactory" > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> > --> > <!-- Case insensitive stop word removal. > add enablePositionIncrements=true in both the index and query > analyzers to leave a 'gap' for more accurate phrase queries. > --> > <filter class="solr.StopFilterFactory" > ignoreCase="true" > words="stopwords.txt" > enablePositionIncrements="true" > /> > <filter class="solr.WordDelimiterFilterFactory" > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> > <!-- <filter class="solr.LowerCaseFilterFactory"/> --> > <!-- <filter class="solr.SnowballPorterFilterFactory" > language="English" protected="protwords.txt"/> --> > <filter > > class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory" > language="English" protected="protwords.txt" unstemmed="unstemmed.txt"/> > </analyzer> > > > and that is the field def: > > <field name="msg_body" type="body_text" termVectors="true" indexed="true" > stored="true"/> > > > On Wed, Jan 20, 2010 at 7:53 PM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > That's really hard to say without seeing your configuration <G>... > > > > If your field has WordDelimiterFactory with the proper catenate > > options set to one, that'd do it. > > > > Can you post the relevant parts of your schema? > > > > Erick > > > > On Wed, Jan 20, 2010 at 12:46 PM, Bogdan Vatkov <bogdan.vat...@gmail.com > > >wrote: > > > > > I am not absolutely sure about what I am saying but I think after > > > tokenization I get the URLs as single tokens but with all the > > "interesting > > > symbols" :) like "/",":" removed from the token. > > > Is it normal? Is there a chance I misconfigured something? > > > > > > Best regards, > > > Bogdan > > > > > > On Wed, Jan 20, 2010 at 7:03 PM, Erick Erickson < > erickerick...@gmail.com > > > >wrote: > > > > > > > I guess it depends on what you mean by "extract". There's > > > > nothing that I know of that, say, stores them to a file or > > > > separate field, or even does anything special with them. > > > > > > > > I think StandardTokenizerFactory tries to keep URLs > > > > together as a token in the field, but it's just another > > > > token... You should check though.... > > > > > > > > FWIW > > > > Erick > > > > > > > > On Wed, Jan 20, 2010 at 9:52 AM, Bogdan Vatkov < > > bogdan.vat...@gmail.com > > > > >wrote: > > > > > > > > > Sorry, I meant completely server-side - even more I want that at > > > indexing > > > > > time (I do not care about query-time as I am reading later the > whole > > > > index > > > > > anyway). > > > > > > > > > > On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson < > > > erickerick...@gmail.com > > > > > >wrote: > > > > > > > > > > > Do you mean you want the URLs to be extracted on the client? > > > > > > If so, no. Filters/analyzers reside on the server, not the > client. > > > > > > You'll have to do it with custom code.... > > > > > > > > > > > > Erick > > > > > > > > > > > > On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov < > > > > bogdan.vat...@gmail.com > > > > > > >wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I want to extract URLs (http://..., as well as file://... or > > even > > > > > > //.....) > > > > > > > while pushing documents into Solr. > > > > > > > Is it possible with the Filters/Analyzers available nowadays? > > > > > > > I looked into the doc but could not find anything related to > it. > > > > > > > > > > > > > > Best regards, > > > > > > > Bogdan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best regards, > > > > > Bogdan > > > > > > > > > > > > > > > > > > -- > Best regards, > Bogdan >