On Tue, 30 Nov 2004 16:33:11 +0100 sam <[EMAIL PROTECTED]> wrote:

> I m looking for a way to crawl only for a given laguage.
> Subject is a pretty big domain located on different servers.
> there are mostly two languages available and I want to index only one of 
> them.
> As I dont have any influence about how they get saved and even dont know 
> most cases yet I hoped there would be a way to have the crawler find out 
> about the language and store only english or only german content in the db.

If the documents are normal text and have very few common words you
could use an english dictionary as bad-words for german, and vice-versa.
But in one organisation there are probably many common words, for
example products, people, locations; so that approach would only be
partly sucessful.

Another way to do it would be to have an independent spider process that
crawls the whole tree and compares some words taken from the document to
the two dictionaries and decides what language it is.  Then it builds
two new trees of fake documents with the "wrong" links taken out and
tells htdig to index those, using htsearch's url remapping to make the
final results page point to the correct locations.  I have used this
technique, but only for a site of a few hundred documents.  Of course in
computing terms you now have two copies of the same data, and have to
make a lot of effort to keep them in step -- and need lots of disk space.

You could make the htdig indexing work through a filtering proxy server
and have that configured to reject based on a dictionary perhaps?  There
must be many net-nanny type filtering proxies that might do the job.


The final choice might depend on just how many documents you have, and
how frequently they change, and even on whether one file might have its
language suddenly changed.  And how much time (or budget) you can spend
on writing code.



Mike
-- 
Mike Causer                          Email - mailto:[EMAIL PROTECTED]
GPG KeyID 1C2DDA07                       WWW - http://www.mikecauser.com
Flood the fen again! - Wicken Fen enlargement - http://www.wicken.org.uk


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to