Re: Extracting top level URL when indexing document

2018-06-19 Thread Gus Heck
I don't understand the inclusion of 'n' in the character classes in this pattern... it's pretty clear that the broken examples in the OP were where the letter n occurred in the domain name. I expect a similar problem for user parts that contain n... ^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+) On Tu

RE: [EXT] Re: Extracting top level URL when indexing document

2018-06-13 Thread Hanjan, Harinder
urlProcessor I will look at how to submit a patch to the Java doc. Thanks! Harinder -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Wednesday, June 13, 2018 12:13 AM To: solr-user Subject: [EXT] Re: Extracting top level URL when

Re: Extracting top level URL when indexing document

2018-06-12 Thread Alexandre Rafalovitch
Try URLClassifyProcessorFactory in the processing chain instead, configured in solrconfig.xml There is very little documentation for it, so check the source for exact params. Or search for the blog post introducing it several years ago. Documentation patches would be welcome. Regards, Alex

Re: Extracting top level URL when indexing document

2018-06-12 Thread Kevin Risden
Looks like stop words (in, and, on) is what is breaking. The regex looks like it is correct. Kevin Risden On Tue, Jun 12, 2018, 18:02 Hanjan, Harinder wrote: > Hello! > > I am indexing web documents and have a need to extract their top-level URL > to be stored in a different field. I have had s

Extracting top level URL when indexing document

2018-06-12 Thread Hanjan, Harinder
Hello! I am indexing web documents and have a need to extract their top-level URL to be stored in a different field. I have had some success with the PatternTokenizerFactory (relevant schema bits at the bottom) but the behavior appears to be inconsistent. Most of the times, the top level URL i