RE: Indexing URLs from websites

2014-01-22 Thread Teague James
...@openindex.io] Sent: Tuesday, January 21, 2014 3:09 PM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi, are you getting pdfs at all? Sounds like a problem with url filters, those also work on the linkdb. You should also try dumping the linkdb and inspect it for urls

RE: Indexing URLs from websites

2014-01-21 Thread Markus Jelsma
t;, "/Article 2", and  "/documents/Article 1.pdf" How can I get these URLs? -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, January 20, 2014 9:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Well

RE: Indexing URLs from websites

2014-01-21 Thread Teague James
"/Article 2", and "/documents/Article 1.pdf" How can I get these URLs? -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Monday, January 20, 2014 9:08 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Well it is

RE: Indexing URLs from websites

2014-01-20 Thread Markus Jelsma
solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Progress! > > I changed the value of that property in nutch-default.xml and I am getting > the anchor field now. However, the stuff going in there is a bit random and > doesn't seem to correlate to

RE: Indexing URLs from websites

2014-01-17 Thread Teague James
ith me on this - I really appreciate your help! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Friday, January 17, 2014 6:46 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- > From:Teague James

RE: Indexing URLs from websites

2014-01-17 Thread Markus Jelsma
-Original message- > From:Teague James > Sent: Thursday 16th January 2014 20:23 > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Okay. I had used that previously and I just tried it again. The following > generated no e

RE: Indexing URLs from websites

2014-01-16 Thread Teague James
: RE: Indexing URLs from websites Usage: SolrIndexer [-linkdb ] [-params k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] [-deleteGone] [-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize] You must point to the linkdb via the -linkdb parameter. -Original mes

RE: Indexing URLs from websites

2014-01-16 Thread Markus Jelsma
> Sent: Thursday 16th January 2014 16:57 > To: solr-user@lucene.apache.org > Subject: RE: Indexing URLs from websites > > Okay. I changed my solrindex to this: > > bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb > crawl/segments/20140115143147 >

RE: Indexing URLs from websites

2014-01-16 Thread Teague James
[mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 10:44 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites Hi - you cannot use wildcards for segments. You need to give one segment or a -dir segments_dir. Check the usage of your indexer command

RE: Indexing URLs from websites

2014-01-16 Thread Markus Jelsma
ve this one produced the same errors. > > When/How are the missing directories supposed to be created? > > I really appreciate the help! Thank you very much! > > -Original Message- > From: Markus Jelsma [mailto:markus.jel...@openindex.io] > Sent

RE: Indexing URLs from websites

2014-01-16 Thread Teague James
very much! -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Thursday, January 16, 2014 5:45 AM To: solr-user@lucene.apache.org Subject: RE: Indexing URLs from websites -Original message- > From:Teague James > Sent: Wednesday 15th January

RE: Indexing URLs from websites

2014-01-16 Thread Markus Jelsma
-Original message- > From:Teague James > Sent: Wednesday 15th January 2014 22:01 > To: solr-user@lucene.apache.org > Subject: Re: Indexing URLs from websites > > I am still unsuccessful in getting this to work. My expectation is that the > index-anchor plugin shou

Re: Indexing URLs from websites

2014-01-15 Thread Teague James
I am still unsuccessful in getting this to work. My expectation is that the index-anchor plugin should produce values for the field anchor. However this field is not showing up in my Solr index no matter what I try. Here's what I have in my nutch-site.xml for plugins: protocol-http|urlfilter-regex

Re: Indexing URLs from websites

2014-01-07 Thread Otis Gospodnetic
You could use something like Apache Droids - http://incubator.apache.org/droids/ Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Tue, Jan 7, 2014 at 2:27 PM, Teague James wrote: > I am trying to index a website that contai

Indexing URLs from websites

2014-01-07 Thread Teague James
I am trying to index a website that contains links to documents such as PDF, Word, etc. The intent is to be able to store the URLs for the links to the documents. For example, when indexing www.example.com which has links on the page like "Example Document" which points to www.example.com/docs/ex