Markus, With some help from another user on the Nutch list I did a dump and found that the URLs I am trying to capture are in Nutch. However, when I index them with Solr I am not getting them. What I get in the dump is this:
http://www.example.com/pdfs/article1.pdf Status: 2 (db_fetched) Fetch time: [date/time stamp] Modified time: [date/time stamp] Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.0010525313 Signature: null Metadata: Content-Type: application/pdf_pst_: success(1), lastModified=0 -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, January 21, 2014 3:09 PM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi, are you getting pdfs at all? Sounds like a problem with url filters, those also work on the linkdb. You should also try dumping the linkdb and inspect it for urls. Btw, i noticed this is om the solr list, its best to open a new discussion on the nutch user mailing list. CheersTeague James <teag...@insystechinc.com> schreef:What I'm getting is just the anchor text. In cases where there are multiple anchors I am getting a comma separated list of anchor text - which is fine. However, I am not getting all of the anchors that are on the page, nor am I getting any of the URLs. The anchors I am getting back never include anchors that lead to documents - which is the primary objective. So on a page that looks something like: Article 1 text blah blah blah [Read more] Article 2 test blah blah blah [Read more] Download a the [PDF] Where each [Read more] links to a page where the rest of the article is stored and [PDF] links to a PDF document (these are relative links). That I get back in the anchor field is "[Read more]","[Read more]" I am not getting the "[PDF]" anchor and I am not getting any of the URLs that those anchors point to - like "/Artilce 1", "/Article 2", and "/documents/Article 1.pdf" How can I get these URLs? -----Original Message----- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, January 20, 2014 9:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Well it is hard to get a specific anchor because there is usually more than one. The content of the anchors field should be correct. What would you expect if there are multiple anchors? -----Original message----- > From:Teague James <teag...@insystechinc.com> > Sent: Friday 17th January 2014 18:13 > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Progress! > > I changed the value of that property in nutch-default.xml and I am getting > the anchor field now. However, the stuff going in there is a bit random and > doesn't seem to correlate to the pages I'm crawling. The primary objective is > that when there is something on the page that is a link to a file > ...href="/blah/somefile.pdf">Get the PDF!<... (using ... to prevent actual > code in the email) I want to capture that URL and the anchor text "Get the > PDF!" into field(s). > > Am I going in the right direction on this? > > Thank you so much for sticking with me on this - I really appreciate your > help! > > -----Original Message----- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent: Friday, January 17, 2014 6:46 AM > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > > > > > -----Original message----- > > From:Teague James <teag...@insystechinc.com> > > Sent: Thursday 16th January 2014 20:23 > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites > > > > Okay. I had used that previously and I just tried it again. The following > > generated no errors: > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > > crawl/linkdb -dir crawl/segments/ > > > > Solr is still not getting an anchor field and the outlinks are not > > appearing in the index anywhere else. > > > > To be sure I deleted the crawl directory and did a fresh crawl using: > > > > bin/nutch crawl urls -dir crawl -depth 3 -topN 50 > > > > Then > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > > crawl/linkdb -dir crawl/segments/ > > > > No errors, but no anchor fields or outlinks. One thing in the response from > > the crawl that I found interesting was a line that said: > > > > LinkDb: internal links will be ignored. > > Good catch! That is likely the problem. > > > > > What does that mean? > > <property> > <name>db.ignore.internal.links</name> > <value>true</value> > <description>If true, when adding new links to a page, links from > the same host are ignored. This is an effective way to limit the > size of the link database, keeping only the highest quality > links. > </description> > </property> > > So change the property, rebuild the linkdb and try reindexing once > again :) > > > > > -----Original Message----- > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > Sent: Thursday, January 16, 2014 11:08 AM > > To: solr-user@lucene.apache.org > > Subject: RE: Indexing URLs from websites > > > > Usage: SolrIndexer <solr url> <crawldb> [-linkdb <linkdb>] [-params > > k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] > > [-deleteGone] [-deleteRobotsNoIndex] > > [-deleteSkippedByIndexingFilter] [-filter] [-normalize] > > > > You must point to the linkdb via the -linkdb parameter. > > > > -----Original message----- > > > From:Teague James <teag...@insystechinc.com> > > > Sent: Thursday 16th January 2014 16:57 > > > To: solr-user@lucene.apache.org > > > Subject: RE: Indexing URLs from websites > > > > > > Okay. I changed my solrindex to this: > > > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb > > > crawl/linkdb > > > crawl/segments/20140115143147 > > > > > > I got the same errors: > > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input > > > path does not exist: file:/.../crawl/linkdb/crawl_fetch > > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse > > > Input path does not exist: file:/.../crawl/linkdb/parse_data Input > > > path does not exist: file:/.../crawl/linkdb/parse_text Along with > > > a Java stacktrace > > > > > > Those linkdb folders are not being created. > > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > > Sent: Thursday, January 16, 2014 10:44 AM > > > To: solr-user@lucene.apache.org > > > Subject: RE: Indexing URLs from websites > > > > > > Hi - you cannot use wildcards for segments. You need to give one segment > > > or a -dir segments_dir. Check the usage of your indexer command. > > > > > > -----Original message----- > > > > From:Teague James <teag...@insystechinc.com> > > > > Sent: Thursday 16th January 2014 16:43 > > > > To: solr-user@lucene.apache.org > > > > Subject: RE: Indexing URLs from websites > > > > > > > > Hello Markus, > > > > > > > > I do get a linkdb folder in the crawl folder that gets created - but it > > > > is created at the time that I execute the command automatically by > > > > Nutch. I just tried to use solrindex against yesterday's cawl and did > > > > not get any errors, but did not get the anchor field or any of the > > > > outlinks. I used this command: > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb -linkdb > > > > crawl/linkdb crawl/segments/* > > > > > > > > I then tried: > > > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb > > > > crawl/linkdb > > > > crawl/segments/* This produced the following errors: > > > > Indexer: org.apache.hadoop.mapred.InvalidInputException: Input > > > > path does not exist: file:/.../crawl/linkdb/crawl_fetch > > > > Input path does not exist: file:/.../crawl/linkdb/crawl_parse > > > > Input path does not exist: file:/.../crawl/linkdb/parse_data > > > > Input path does not exist: file:/.../crawl/linkdb/parse_text > > > > Along with a Java stacktrace > > > > > > > > So I tried invertlinks as you had previously suggested. No errors, but > > > > the above missing directories were not created. Using the same > > > > solrindex command above this one produced the same errors. > > > > > > > > When/How are the missing directories supposed to be created? > > > > > > > > I really appreciate the help! Thank you very much! > > > > > > > > -----Original Message----- > > > > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > > > > Sent: Thursday, January 16, 2014 5:45 AM > > > > To: solr-user@lucene.apache.org > > > > Subject: RE: Indexing URLs from websites > > > > > > > > > > > > -----Original message----- > > > > > From:Teague James <teag...@insystechinc.com> > > > > > Sent: Wednesday 15th January 2014 22:01 > > > > > To: solr-user@lucene.apache.org > > > > > Subject: Re: Indexing URLs from websites > > > > > > > > > > I am still unsuccessful in getting this to work. My > > > > > expectation is that the index-anchor plugin should produce > > > > > values for the field anchor. However this field is not showing up in > > > > > my Solr index no matter what I try. > > > > > > > > > > Here's what I have in my nutch-site.xml for plugins: > > > > > <value>protocol-http|urlfilter-regex|parse-html|index-(basic|a > > > > > nc > > > > > ho > > > > > r) > > > > > |q > > > > > uery-( > > > > > basic|site|url)|indexer-solr|response(json|xml)|summary-basic| > > > > > basic|site|sc > > > > > basic|site|or > > > > > basic|site|in > > > > > basic|site|g- > > > > > basic|site|optic| > > > > > urlnormalizer-(pass|reges|basic)</value> > > > > > > > > > > I am using the schema-solr4.xml from the Nutch package and I > > > > > added the _version_ field > > > > > > > > > > Here's the command I'm running: > > > > > Bin/nutch crawl urls -solr http://localhost/solr -depth 3 topN > > > > > 50 > > > > > > > > > > The fields that Solr returns are: > > > > > Content, title, segment, boost, digest, tstamp, id, url, and > > > > > _version_ > > > > > > > > > > Note that the url field is the url of the page being indexed > > > > > and not the > > > > > url(s) of the documents that may be outlinks on that page. It > > > > > is the outlinks that I am trying to get into the index. > > > > > > > > > > What am I missing? I also tried using the invertlinks command > > > > > that Markus suggested, but that did not work either, though I > > > > > do appreciate the suggestion. > > > > > > > > That did get you a LinkDB right? You need to call solrindex and use the > > > > linkdb's location as part of the arguments, only then Nutch knows about > > > > it and will use the data contained in the LinkDB together with the > > > > index-anchor plugin to write the anchor field in your Solrindex. > > > > > > > > > > > > > > Any help is appreciated! Thanks! > > > > > > > > > > <Markus Jelsma> Wrote: > > > > > You need to use the invertlinks command to build a database > > > > > with docs with inlinks and anchors. Then use the index-anchor > > > > > plugin when indexing. Then you will have a multivalued field with > > > > > anchors pointing to your document. > > > > > > > > > > <Teague James> Wrote: > > > > > I am trying to index a website that contains links to > > > > > documents such as PDF, Word, etc. The intent is to be able to > > > > > store the URLs for the links to the documents. > > > > > > > > > > For example, when indexing www.example.com which has links on > > > > > the page like "Example Document" which points to > > > > > www.example.com/docs/example.pdf, I want Solr to store the > > > > > text of the link, "Example Document", and the URL for the > > > > > link, "www.example.com/docs/example.pdf" in separate fields. > > > > > I've tried using Nutch 1.7 with Solr 4.6.0 and have > > > > > successfully indexed the page content, but I am not getting > > > > > the URLs from the links. There are no document type > > > > > restrictions in Nutch for PDF or Word. Any suggestions on how I can > > > > > accomplish this? Should I use a different method than Nutch for > > > > > crawling the site? > > > > > > > > > > I appreciate any help on this! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >