On Thursday, November 21, 2002, at 01:21 PM, Wanrong Qiu wrote:
I haved used htdig for our intranet indexing for couple of years. I use incrementalUnder normal circumstances such documents should not be reindexed, unless they have changed or result in misleading information regarding when they were last modified. If they are being reindexed and you have a list of such files that you want to exclude, then you might want to take a look at the exclude_url attribute
indexing in order to avoid re-indexing and to save indexing time. But recently I have
found more and more outdated pages gotten indexed, those pages that actually have
no references in any other pages but not deleted in the file system. Are there any way
to ask htdig not to index those pages and even better to get rid of them from the
database, but in the meantime I can avoid using -i flag to start a total new digging?
I use htdig 3.1.6 in solaris 2.8.
http://www.htdig.org/attrs.html#exclude_urls
Otherwise, I don't think there is much that you can do short of rebuilding the databases. I don't believe the databases maintain sufficient information to determine which documents are no longer referenced, and it would therefore be necessary to reindex all pages in order to figure out what is and isn't referenced.
Jim
-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

