On Tue, 30 Nov 2004 16:33:11 +0100 sam <[EMAIL PROTECTED]> wrote: > I m looking for a way to crawl only for a given laguage. > Subject is a pretty big domain located on different servers. > there are mostly two languages available and I want to index only one of > them. > As I dont have any influence about how they get saved and even dont know > most cases yet I hoped there would be a way to have the crawler find out > about the language and store only english or only german content in the db.
If the documents are normal text and have very few common words you could use an english dictionary as bad-words for german, and vice-versa. But in one organisation there are probably many common words, for example products, people, locations; so that approach would only be partly sucessful. Another way to do it would be to have an independent spider process that crawls the whole tree and compares some words taken from the document to the two dictionaries and decides what language it is. Then it builds two new trees of fake documents with the "wrong" links taken out and tells htdig to index those, using htsearch's url remapping to make the final results page point to the correct locations. I have used this technique, but only for a site of a few hundred documents. Of course in computing terms you now have two copies of the same data, and have to make a lot of effort to keep them in step -- and need lots of disk space. You could make the htdig indexing work through a filtering proxy server and have that configured to reject based on a dictionary perhaps? There must be many net-nanny type filtering proxies that might do the job. The final choice might depend on just how many documents you have, and how frequently they change, and even on whether one file might have its language suddenly changed. And how much time (or budget) you can spend on writing code. Mike -- Mike Causer Email - mailto:[EMAIL PROTECTED] GPG KeyID 1C2DDA07 WWW - http://www.mikecauser.com Flood the fen again! - Wicken Fen enlargement - http://www.wicken.org.uk ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

