[
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716761#comment-16716761
]
Tim Steenbeke commented on CONNECTORS-1562:
-------------------------------------------
[[email protected]] I have a URL with the full sitemap that has to be
crawled ~^(and a full exclude sitemap)^~.
If i use this URL as seed, do I have to set the hop filters to any value (e.g.
redirect:0 and link:1) ?
If one or multiple links are deleted from this sitemap, will the document be
deleted from ES ?
How should I set up the job to only keep the crawled sites in the sitemap ?
> Documents unreachable due to hopcount are not considered unreachable on
> cleanup pass
> ------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
> Issue Type: Bug
> Components: Elastic Search connector, Web connector
> Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
> Reporter: Tim Steenbeke
> Assignee: Karl Wright
> Priority: Critical
> Labels: starter
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)