Re: Extracting URLs while indexing

Bogdan Vatkov Wed, 20 Jan 2010 12:04:14 -0800

Now I see I didn't review all the config that I took from the default
config.
Removed the WordDelimiterFilter and the StandardTokenizer seems to keep URLs
but splits relative paths (e.g. /file/location/file.txt) and I want to keep
such as single token.
Any ideas?


On Wed, Jan 20, 2010 at 8:13 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> You really need to have this page as a handy reference.....
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>Look in
> particular at what happens with
> WordDelimiterFilterFactory,
> you're breaking your tokens up on non-alpha characters and
> case change and letter<->number transitions. Then
> you're asking that things "of a kind" be put back into
> words.
>
> You might try StandardTokenizerFactory instead....
>
> Erick
>
> On Wed, Jan 20, 2010 at 12:55 PM, Bogdan Vatkov <bogdan.vat...@gmail.com
> >wrote:
>
> > that is the field type:
> >    <fieldType name="body_text" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <!-- in this example, we will only use synonyms at query time
> >        <filter class="solr.SynonymFilterFactory"
> > synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> >        -->
> >        <!-- Case insensitive stop word removal.
> >          add enablePositionIncrements=true in both the index and query
> >          analyzers to leave a 'gap' for more accurate phrase queries.
> >        -->
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > <!--        <filter class="solr.LowerCaseFilterFactory"/> -->
> > <!--        <filter class="solr.SnowballPorterFilterFactory"
> > language="English" protected="protwords.txt"/> -->
> >        <filter
> >
> >
> class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory"
> > language="English" protected="protwords.txt" unstemmed="unstemmed.txt"/>
> >      </analyzer>
> >
> >
> > and that is the field def:
> >
> > <field name="msg_body" type="body_text" termVectors="true" indexed="true"
> > stored="true"/>
> >
> >
> > On Wed, Jan 20, 2010 at 7:53 PM, Erick Erickson <erickerick...@gmail.com
> > >wrote:
> >
> > > That's really hard to say without seeing your configuration <G>...
> > >
> > > If your field has WordDelimiterFactory with the proper catenate
> > > options set to one, that'd do it.
> > >
> > > Can you post the relevant parts of your schema?
> > >
> > > Erick
> > >
> > > On Wed, Jan 20, 2010 at 12:46 PM, Bogdan Vatkov <
> bogdan.vat...@gmail.com
> > > >wrote:
> > >
> > > > I am not absolutely sure about what I am saying but I think after
> > > > tokenization I get the URLs as single tokens but with all the
> > > "interesting
> > > > symbols" :) like "/",":" removed from the token.
> > > > Is it normal? Is there a chance I misconfigured something?
> > > >
> > > > Best regards,
> > > > Bogdan
> > > >
> > > > On Wed, Jan 20, 2010 at 7:03 PM, Erick Erickson <
> > erickerick...@gmail.com
> > > > >wrote:
> > > >
> > > > > I guess it depends on what you mean by "extract". There's
> > > > > nothing that I know of that, say, stores them to a file or
> > > > > separate field, or even does anything special with them.
> > > > >
> > > > > I think StandardTokenizerFactory tries to keep URLs
> > > > > together as a token in the field, but it's just another
> > > > > token... You should check though....
> > > > >
> > > > > FWIW
> > > > > Erick
> > > > >
> > > > > On Wed, Jan 20, 2010 at 9:52 AM, Bogdan Vatkov <
> > > bogdan.vat...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Sorry, I meant completely server-side - even more I want that at
> > > > indexing
> > > > > > time (I do not care about query-time as I am reading later the
> > whole
> > > > > index
> > > > > > anyway).
> > > > > >
> > > > > > On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson <
> > > > erickerick...@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Do you mean you want the URLs to be extracted on the client?
> > > > > > > If so, no. Filters/analyzers reside on the server, not the
> > client.
> > > > > > > You'll have to do it with custom code....
> > > > > > >
> > > > > > > Erick
> > > > > > >
> > > > > > > On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov <
> > > > > bogdan.vat...@gmail.com
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I want to extract URLs (http://..., as well as file://... or
> > > even
> > > > > > > //.....)
> > > > > > > > while pushing documents into Solr.
> > > > > > > > Is it possible with the Filters/Analyzers available nowadays?
> > > > > > > > I looked into the doc but could not find anything related to
> > it.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Bogdan
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Bogdan
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> > Bogdan
> >
>



-- 
Best regards,
Bogdan

Re: Extracting URLs while indexing

Reply via email to