Re: Extracting URLs while indexing

2010-01-20 Thread Bogdan Vatkov
Now I see I didn't review all the config that I took from the default config. Removed the WordDelimiterFilter and the StandardTokenizer seems to keep URLs but splits relative paths (e.g. /file/location/file.txt) and I want to keep such as single token. Any ideas? On Wed, Jan 20, 2010 at 8:13 PM, E

Re: Extracting URLs while indexing

2010-01-20 Thread Erick Erickson
You really need to have this page as a handy reference. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Look in particular at what happens with WordDelimiterFilterFactory, you're breaking your tokens up on non-alpha char

Re: Extracting URLs while indexing

2010-01-20 Thread Bogdan Vatkov
that is the field type: and that is the field def: On Wed, Jan 20, 2010 at 7:53 PM, Erick Erickson wrote: > That's really hard to say without seeing your configuration ... > > If your field has WordDelimiterFactory wi

Re: Extracting URLs while indexing

2010-01-20 Thread Erick Erickson
That's really hard to say without seeing your configuration ... If your field has WordDelimiterFactory with the proper catenate options set to one, that'd do it. Can you post the relevant parts of your schema? Erick On Wed, Jan 20, 2010 at 12:46 PM, Bogdan Vatkov wrote: > I am not absolutely s

Re: Extracting URLs while indexing

2010-01-20 Thread Bogdan Vatkov
I am not absolutely sure about what I am saying but I think after tokenization I get the URLs as single tokens but with all the "interesting symbols" :) like "/",":" removed from the token. Is it normal? Is there a chance I misconfigured something? Best regards, Bogdan On Wed, Jan 20, 2010 at 7:0

Re: Extracting URLs while indexing

2010-01-20 Thread Erick Erickson
I guess it depends on what you mean by "extract". There's nothing that I know of that, say, stores them to a file or separate field, or even does anything special with them. I think StandardTokenizerFactory tries to keep URLs together as a token in the field, but it's just another token... You sho

Re: Extracting URLs while indexing

2010-01-20 Thread Bogdan Vatkov
Sorry, I meant completely server-side - even more I want that at indexing time (I do not care about query-time as I am reading later the whole index anyway). On Wed, Jan 20, 2010 at 2:40 AM, Erick Erickson wrote: > Do you mean you want the URLs to be extracted on the client? > If so, no. Filters/

Re: Extracting URLs while indexing

2010-01-19 Thread Erick Erickson
Do you mean you want the URLs to be extracted on the client? If so, no. Filters/analyzers reside on the server, not the client. You'll have to do it with custom code Erick On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov wrote: > Hi, > > I want to extract URLs (http://..., as well as file://..

Extracting URLs while indexing

2010-01-19 Thread Bogdan Vatkov
Hi, I want to extract URLs (http://..., as well as file://... or even //.) while pushing documents into Solr. Is it possible with the Filters/Analyzers available nowadays? I looked into the doc but could not find anything related to it. Best regards, Bogdan