RE: Indexing URLs from websites

Markus Jelsma Mon, 20 Jan 2014 06:09:29 -0800
Well it is hard to get a specific anchor because there is usually more than 
one. The content of the anchors field should be correct. What would you expect 
if there are multiple anchors? 
 
-----Original message-----
> From:Teague James <[email protected]>
> Sent: Friday 17th January 2014 18:13
> To: [email protected]
> Subject: RE: Indexing URLs from websites
> 
> Progress!
> 
> I changed the value of that property in nutch-default.xml and I am getting 
> the anchor field now. However, the stuff going in there is a bit random and 
> doesn't seem to correlate to the pages I'm crawling. The primary objective is 
> that when there is something on the page that is a link to a file 
> ...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual 
> code in the email) I want to capture that URL and the anchor text "Get the 
> PDF!" into field(s).
> 
> Am I going in the right direction on this?
> 
> Thank you so much for sticking with me on this - I really appreciate your 
> help!
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]] 
> Sent: Friday, January 17, 2014 6:46 AM
> To: [email protected]
> Subject: RE: Indexing URLs from websites
> 
> 
> 
>  
>  
> -----Original message-----
> > From:Teague James <[email protected]>
> > Sent: Thursday 16th January 2014 20:23
> > To: [email protected]
> > Subject: RE: Indexing URLs from websites
> > 
> > Okay. I had used that previously and I just tried it again. The following 
> > generated no errors:
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > Solr is still not getting an anchor field and the outlinks are not 
> > appearing in the index anywhere else.
> > 
> > To be sure I deleted the crawl directory and did a fresh crawl using:
> > 
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> > 
> > Then
> > 
> > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > crawl/linkdb -dir crawl/segments/
> > 
> > No errors, but no anchor fields or outlinks. One thing in the response from 
> > the crawl that I found interesting was a line that said:
> > 
> > LinkDb: internal links will be ignored.
> 
> Good catch! That is likely the problem. 
> 
> > 
> > What does that mean?
> 
> <property>
>   <name>db.ignore.internal.links</name>
>   <value>true</value>
>   <description>If true, when adding new links to a page, links from
>   the same host are ignored.  This is an effective way to limit the
>   size of the link database, keeping only the highest quality
>   links.
>   </description>
> </property>
> 
> So change the property, rebuild the linkdb and try reindexing once again :)
> 
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: Thursday, January 16, 2014 11:08 AM
> > To: [email protected]
> > Subject: RE: Indexing URLs from websites
> > 
> > Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] [-params 
> > k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] 
> > [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] 
> > [-filter] [-normalize]
> > 
> > You must point to the linkdb via the -linkdb parameter. 
> >  
> > -----Original message-----
> > > From:Teague James <[email protected]>
> > > Sent: Thursday 16th January 2014 16:57
> > > To: [email protected]
> > > Subject: RE: Indexing URLs from websites
> > > 
> > > Okay. I changed my solrindex to this:
> > > 
> > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
> > > crawl/linkdb
> > > crawl/segments/20140115143147
> > > 
> > > I got the same errors:
> > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path 
> > > does not exist: file:/.../crawl/linkdb/crawl_fetch
> > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > > path does not exist: file:/.../crawl/linkdb/parse_text Along with a 
> > > Java stacktrace
> > > 
> > > Those linkdb folders are not being created.
> > > 
> > > -----Original Message-----
> > > From: Markus Jelsma [mailto:[email protected]]
> > > Sent: Thursday, January 16, 2014 10:44 AM
> > > To: [email protected]
> > > Subject: RE: Indexing URLs from websites
> > > 
> > > Hi - you cannot use wildcards for segments. You need to give one segment 
> > > or a -dir segments_dir. Check the usage of your indexer command. 
> > >  
> > > -----Original message-----
> > > > From:Teague James <[email protected]>
> > > > Sent: Thursday 16th January 2014 16:43
> > > > To: [email protected]
> > > > Subject: RE: Indexing URLs from websites
> > > > 
> > > > Hello Markus,
> > > > 
> > > > I do get a linkdb folder in the crawl folder that gets created - but it 
> > > > is created at the time that I execute the command automatically by 
> > > > Nutch. I just tried to use solrindex against yesterday's cawl and did 
> > > > not get any errors, but did not get the anchor field or any of the 
> > > > outlinks. I used this command:
> > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb 
> > > > crawl/linkdb crawl/segments/*
> > > > 
> > > > I then tried:
> > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb 
> > > > crawl/linkdb
> > > > crawl/segments/* This produced the following errors:
> > > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input 
> > > > path does not exist: file:/.../crawl/linkdb/crawl_fetch
> > > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse
> > > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input 
> > > > path does not exist: file:/.../crawl/linkdb/parse_text Along with 
> > > > a Java stacktrace
> > > > 
> > > > So I tried invertlinks as you had previously suggested. No errors, but 
> > > > the above missing directories were not created. Using the same 
> > > > solrindex command above this one produced the same errors. 
> > > > 
> > > > When/How are the missing directories supposed to be created?
> > > > 
> > > > I really appreciate the help! Thank you very much!
> > > > 
> > > > -----Original Message-----
> > > > From: Markus Jelsma [mailto:[email protected]]
> > > > Sent: Thursday, January 16, 2014 5:45 AM
> > > > To: [email protected]
> > > > Subject: RE: Indexing URLs from websites
> > > > 
> > > >  
> > > > -----Original message-----
> > > > > From:Teague James <[email protected]>
> > > > > Sent: Wednesday 15th January 2014 22:01
> > > > > To: [email protected]
> > > > > Subject: Re: Indexing URLs from websites
> > > > > 
> > > > > I am still unsuccessful in getting this to work. My expectation 
> > > > > is that the index-anchor plugin should produce values for the 
> > > > > field anchor. However this field is not showing up in my Solr index 
> > > > > no matter what I try.
> > > > > 
> > > > > Here's what I have in my nutch-site.xml for plugins:
> > > > > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anc
> > > > > ho
> > > > > r)
> > > > > |q
> > > > > uery-(
> > > > > basic|site|url)|indexer-solr|response(json|xml)|summary-basic|sc
> > > > > basic|site|or
> > > > > basic|site|in
> > > > > basic|site|g-
> > > > > basic|site|optic|
> > > > > urlnormalizer-(pass|reges|basic)</value>
> > > > > 
> > > > > I am using the schema-solr4.xml from the Nutch package and I 
> > > > > added the _version_ field
> > > > > 
> > > > > Here's the command I'm running:
> > > > > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN 
> > > > > 50
> > > > > 
> > > > > The fields that Solr returns are:
> > > > > Content, title, segment, boost, digest, tstamp, id, url, and 
> > > > > _version_
> > > > > 
> > > > > Note that the url field is the url of the page being indexed and 
> > > > > not the
> > > > > url(s) of the documents that may be outlinks on that page. It is 
> > > > > the outlinks that I am trying to get into the index.
> > > > > 
> > > > > What am I missing? I also tried using the invertlinks command 
> > > > > that Markus suggested, but that did not work either, though I do 
> > > > > appreciate the suggestion.
> > > > 
> > > > That did get you a LinkDB right? You need to call solrindex and use the 
> > > > linkdb's location as part of the arguments, only then Nutch knows about 
> > > > it and will use the data contained in the LinkDB together with the 
> > > > index-anchor plugin to write the anchor field in your Solrindex.
> > > > 
> > > > > 
> > > > > Any help is appreciated! Thanks!
> > > > > 
> > > > > <Markus Jelsma> Wrote:
> > > > > You need to use the invertlinks command to build a database with 
> > > > > docs with inlinks and anchors. Then use the index-anchor plugin 
> > > > > when indexing. Then you will have a multivalued field with anchors 
> > > > > pointing to your document.
> > > > > 
> > > > > <Teague James> Wrote:
> > > > > I am trying to index a website that contains links to documents 
> > > > > such as PDF, Word, etc. The intent is to be able to store the 
> > > > > URLs for the links to the documents.
> > > > > 
> > > > > For example, when indexing www.example.com which has links on 
> > > > > the page like "Example Document" which points to 
> > > > > www.example.com/docs/example.pdf, I want Solr to store the text 
> > > > > of the link, "Example Document", and the URL for the link, 
> > > > > "www.example.com/docs/example.pdf" in separate fields. I've 
> > > > > tried using Nutch 1.7 with Solr 4.6.0 and have successfully 
> > > > > indexed the page content, but I am not getting the URLs from the 
> > > > > links. There are no document type restrictions in Nutch for PDF 
> > > > > or Word. Any suggestions on how I can accomplish this? Should I use a 
> > > > > different method than Nutch for crawling the site?
> > > > > 
> > > > > I appreciate any help on this!
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> 
>
RE: Indexing URLs from websites

Reply via email to