Re: Solr web crawler with recursive option

2019-04-12 Thread Andrew MacKay
You should look at Nutch apache solution that has Solr client support, it
has all the index options you need and has schema to build Solr collection
with all required fields for indexing.

We have used it and works well, supports sitemap.xml to simplify indexing.

On Fri, Apr 12, 2019 at 6:43 AM Jan Høydahl  wrote:

> I think there may actually be a bug. I was not able to crawl some other
> web site either.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 11. apr. 2019 kl. 18:55 skrev Erick Erickson :
> >
> > You are sending malformed XML to Solr. This can be something as silly as
> having extra spaces at the beginning. I’d capture the page being sent to
> Solr and put it in a formatter to check it….
> >
> > Best,
> > Erick
> >
> >> On Apr 11, 2019, at 3:49 AM, Shivprasad Shetty <
> shivpras...@orioninc.com> wrote:
> >>
> >> Hello Team,
> >>
> >>
> >>   I am working on solr for the first time and got the setup
> done. Now I have created a core using command line and want to perform
> webcrawl of a third party site.
> >> If I try it with individual links, I am able to do the crawl and index
> it to the core.This was done using >
> >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update -jar
> post.jar http://www.example.com
> >>
> >> Now what I intend to do is to give a url and using the recursive option
> (-Drecursive) and let it crawl the entire site.
> >> Note that I am pointing to a website that has around 125 pages and I am
> using the below command >
> >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update
> -Drecursive=yes -jar post.jar http://www.example.com  and
> >> java -Ddata=web -Durl=https://solr:8985/solr/solrhelp/update
> -Drecursive=2 -jar post.jar http://www.example.com
> >>
> >> and I am getting the below error message.
> >> Error:
> >>
> >>
> >> POSTed web resource http://www.example.com (depth: 0)
> >> [Fatal Error] :1:1: Content is not allowed in prolog.
> >> Exception in thread "main" java.lang.RuntimeException:
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is
> not allowed in prolog.
> >>   at
> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1252)
> >>   at
> org.apache.solr.util.SimplePostTool.webCrawl(SimplePostTool.java:616)
> >>   at
> org.apache.solr.util.SimplePostTool.postWebPages(SimplePostTool.java:563)
> >>   at
> org.apache.solr.util.SimplePostTool.doWebMode(SimplePostTool.java:365)
> >>   at
> org.apache.solr.util.SimplePostTool.execute(SimplePostTool.java:187)
> >>   at
> org.apache.solr.util.SimplePostTool.main(SimplePostTool.java:172)
> >> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber:
> 1; Content is not allowed in prolog.
> >>   at
> com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
> >>   at
> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown
> Source)
> >>   at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
> >>   at
> org.apache.solr.util.SimplePostTool.makeDom(SimplePostTool.java:1061)
> >>   at
> org.apache.solr.util.SimplePostTool$PageFetcher.getLinksFromWebPage(SimplePostTool.java:1232)
> >>   ... 5 more
> >>
> >>
> >>
> >> I would be very grateful if anyone could get me to solve this issue I
> have been trying to fix for a couple of days.
> >>
> >>
> >> Regards,
> >> ShivprasadS
> >>
> >>
> >> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> confidential and privileged information. Any unauthorized review, use,
> disclosure or distribution is prohibited. If you are not the intended
> recipient, please contact the sender by reply e-mail, delete and then
> destroy all copies of the original message.
> >
>
>

--

-- 
CONFIDENTIALITY NOTICE: The information contained in this email is 
privileged and confidential and intended only for the use of the individual 
or entity to whom it is addressed.   If you receive this message in error, 
please notify the sender immediately at 613-729-1100 and destroy the 
original message and all copies. Thank you.


Re: Solr with HDFS configuration example running in production/dev

2020-08-19 Thread Andrew MacKay
I believe HDFS support is being deprecated in Solr.  Not sure you want to
continue configuration if support will disappear.

On Wed, Aug 19, 2020 at 7:52 AM Prashant Jyoti  wrote:

> Hi all,
> Hope you are healthy and safe.
>
> Need some help with HDFS configuration.
>
> Could anybody of you share an example of the configuration with which you
> are running Solr with HDFS in any of your production/dev environments?
> I am interested in the parts of SolrConfig.xml / Solr.in.cmd/sh which you
> may have modified. Obviously with the security parts obfuscated.
>
> I am stuck at an error and unable to move ahead. Attaching the exception
> log if anyone is interested to look at the error.
>
> Thanks!
>
> --
> Regards,
> Prashant.
>

-- 
CONFIDENTIALITY NOTICE: The information contained in this email is 
privileged and confidential and intended only for the use of the individual 
or entity to whom it is addressed.   If you receive this message in error, 
please notify the sender immediately at 613-729-1100 and destroy the 
original message and all copies. Thank you.


Re: Solr with HDFS configuration example running in production/dev

2020-08-20 Thread Andrew MacKay
 >
>
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>
> > at org.eclipse.jetty.server.Server.handle(Server.java:500)
>
> > at
>
> >
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)
>
> > at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)
>
> > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)
>
> > at
>
> >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)
>
> > at
>
> > org.eclipse.jetty.io
>
> > .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>
> > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
>
> > at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:388)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
>
> > at
>
> >
>
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
>
> > at java.lang.Thread.run(Thread.java:748)
>
> > Caused by: org.apache.solr.common.SolrException: Unable to create core
>
> > [newcollsolr2_shard1_replica_n1]
>
> > at
>
> >
>
> >
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1327)
>
> > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1217)
>
> > ... 47 more
>
> > Caused by: org.apache.solr.common.SolrException: Illegal char <:> at
> index
>
> > 4: hdfs://
>
> >
>
> >
> hn1-pjhado.tvbhpqtgh3judk1e5ihrx2k21d.tx.internal.cloudapp.net:8020/user/solr-data/newcollsolr2/core_node3/data\
> <http://hn1-pjhado.tvbhpqtgh3judk1e5ihrx2k21d.tx.internal.cloudapp.net:8020/user/solr-data/newcollsolr2/core_node3/data%5C>
>
> > <
> http://hn1-pjhado.tvbhpqtgh3judk1e5ihrx2k21d.tx.internal.cloudapp.net:8020/user/solr-data/newcollsolr2/core_node3/data%5C
> >
>
> > at org.apache.solr.core.SolrCore.(SolrCore.java:1072)
>
> > at org.apache.solr.core.SolrCore.(SolrCore.java:901)
>
> > at
>
> >
>
> >
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1306)
>
> > ... 48 more
>
> > Caused by: java.nio.file.InvalidPathException: Illegal char <:> at index
> 4:
>
> > hdfs://
>
> >
>
> >
> hn1-pjhado.tvbhpqtgh3judk1e5ihrx2k21d.tx.internal.cloudapp.net:8020/user/solr-data/newcollsolr2/core_node3/data\
> <http://hn1-pjhado.tvbhpqtgh3judk1e5ihrx2k21d.tx.internal.cloudapp.net:8020/user/solr-data/newcollsolr2/core_node3/data%5C>
>
> > <
> http://hn1-pjhado.tvbhpqtgh3judk1e5ihrx2k21d.tx.internal.cloudapp.net:8020/user/solr-data/newcollsolr2/core_node3/data%5C
> >
>
> > at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
>
> > at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
>
> > at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
>
> > at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
>
> > at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
>
> > at sun.nio.fs.AbstractPath.resolve(AbstractPath.java:53)
>
> > at org.apache.solr.core.SolrCore.initUpdateLogDir(SolrCore.java:1380)
>
>