...@openindex.io]
Sent: Tuesday, January 21, 2014 3:09 PM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites
Hi, are you getting pdfs at all? Sounds like a problem with url filters, those
also work on the linkdb. You should also try dumping the linkdb and inspect it
for urls
t;, "/Article 2", andÂ
"/documents/Article 1.pdf"
How can I get these URLs?
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Monday, January 20, 2014 9:08 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites
Well
"/Article 2", and
"/documents/Article 1.pdf"
How can I get these URLs?
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Monday, January 20, 2014 9:08 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites
Well it is
solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
>
> Progress!
>
> I changed the value of that property in nutch-default.xml and I am getting
> the anchor field now. However, the stuff going in there is a bit random and
> doesn't seem to correlate to
ith me on this - I really appreciate your help!
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Friday, January 17, 2014 6:46 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites
-Original message-
> From:Teague James
-Original message-
> From:Teague James
> Sent: Thursday 16th January 2014 20:23
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
>
> Okay. I had used that previously and I just tried it again. The following
> generated no e
: RE: Indexing URLs from websites
Usage: SolrIndexer [-linkdb ] [-params
k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] [-deleteGone]
[-deleteRobotsNoIndex] [-deleteSkippedByIndexingFilter] [-filter] [-normalize]
You must point to the linkdb via the -linkdb parameter.
-Original mes
> Sent: Thursday 16th January 2014 16:57
> To: solr-user@lucene.apache.org
> Subject: RE: Indexing URLs from websites
>
> Okay. I changed my solrindex to this:
>
> bin/nutch solrindex http://localhost/solr/ crawl/crawldb crawl/linkdb
> crawl/segments/20140115143147
>
[mailto:markus.jel...@openindex.io]
Sent: Thursday, January 16, 2014 10:44 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites
Hi - you cannot use wildcards for segments. You need to give one segment or a
-dir segments_dir. Check the usage of your indexer command
ve this one produced the same errors.
>
> When/How are the missing directories supposed to be created?
>
> I really appreciate the help! Thank you very much!
>
> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent
very much!
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io]
Sent: Thursday, January 16, 2014 5:45 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing URLs from websites
-Original message-
> From:Teague James
> Sent: Wednesday 15th January
-Original message-
> From:Teague James
> Sent: Wednesday 15th January 2014 22:01
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing URLs from websites
>
> I am still unsuccessful in getting this to work. My expectation is that the
> index-anchor plugin shou
I am still unsuccessful in getting this to work. My expectation is that the
index-anchor plugin should produce values for the field anchor. However this
field is not showing up in my Solr index no matter what I try.
Here's what I have in my nutch-site.xml for plugins:
protocol-http|urlfilter-regex
You could use something like Apache Droids -
http://incubator.apache.org/droids/
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/
On Tue, Jan 7, 2014 at 2:27 PM, Teague James wrote:
> I am trying to index a website that contai
I am trying to index a website that contains links to documents such as PDF,
Word, etc. The intent is to be able to store the URLs for the links to the
documents.
For example, when indexing www.example.com which has links on the page like
"Example Document" which points to www.example.com/docs/ex
15 matches
Mail list logo