I don't understand the inclusion of 'n' in the character classes in this
pattern... it's pretty clear that the broken examples in the OP were where
the letter n occurred in the domain name. I expect a similar problem for
user parts that contain n...
^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)
On Tu
urlProcessor
I will look at how to submit a patch to the Java doc.
Thanks!
Harinder
-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
Sent: Wednesday, June 13, 2018 12:13 AM
To: solr-user
Subject: [EXT] Re: Extracting top level URL when
Try URLClassifyProcessorFactory in the processing chain instead, configured
in solrconfig.xml
There is very little documentation for it, so check the source for exact
params. Or search for the blog post introducing it several years ago.
Documentation patches would be welcome.
Regards,
Alex
Looks like stop words (in, and, on) is what is breaking. The regex looks
like it is correct.
Kevin Risden
On Tue, Jun 12, 2018, 18:02 Hanjan, Harinder
wrote:
> Hello!
>
> I am indexing web documents and have a need to extract their top-level URL
> to be stored in a different field. I have had s
Hello!
I am indexing web documents and have a need to extract their top-level URL to
be stored in a different field. I have had some success with the
PatternTokenizerFactory (relevant schema bits at the bottom) but the behavior
appears to be inconsistent. Most of the times, the top level URL i