subject:"Extracting top level URL when indexing document"

Re: Extracting top level URL when indexing document

2018-06-19 Thread Gus Heck

I don't understand the inclusion of 'n' in the character classes in this pattern... it's pretty clear that the broken examples in the OP were where the letter n occurred in the domain name. I expect a similar problem for user parts that contain n... ^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+) On Tu

RE: [EXT] Re: Extracting top level URL when indexing document

2018-06-13 Thread Hanjan, Harinder

urlProcessor I will look at how to submit a patch to the Java doc. Thanks! Harinder -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Wednesday, June 13, 2018 12:13 AM To: solr-user Subject: [EXT] Re: Extracting top level URL when

Re: Extracting top level URL when indexing document

2018-06-12 Thread Alexandre Rafalovitch

Try URLClassifyProcessorFactory in the processing chain instead, configured in solrconfig.xml There is very little documentation for it, so check the source for exact params. Or search for the blog post introducing it several years ago. Documentation patches would be welcome. Regards, Alex

Re: Extracting top level URL when indexing document

2018-06-12 Thread Kevin Risden

Looks like stop words (in, and, on) is what is breaking. The regex looks like it is correct. Kevin Risden On Tue, Jun 12, 2018, 18:02 Hanjan, Harinder wrote: > Hello! > > I am indexing web documents and have a need to extract their top-level URL > to be stored in a different field. I have had s

Extracting top level URL when indexing document

2018-06-12 Thread Hanjan, Harinder

Hello! I am indexing web documents and have a need to extract their top-level URL to be stored in a different field. I have had some success with the PatternTokenizerFactory (relevant schema bits at the bottom) but the behavior appears to be inconsistent. Most of the times, the top level URL i

Re: Extracting top level URL when indexing document

RE: [EXT] Re: Extracting top level URL when indexing document

Re: Extracting top level URL when indexing document

Re: Extracting top level URL when indexing document

Extracting top level URL when indexing document

5 matches

Site Navigation

Mail list logo

Footer information