That's not bad, but I found a way to do it simply using chr() and passing it a value. It turns out the if I go 0-31 Almost nothing will get through. Even the simples html has something in there from that list. However, by just looking between 14 and 26, one more than carriage return, and one less than escape, it worked really well. I crawled a site with a large number of jpg, gif, mp3, wav, and pdf files. Of the 100's of binaries there only one pdf got through. Not a bad record. I also found that in order for this to work I have to process the URLs. This makes things really slow so I'm going to have to use both this and the "check for extension" function together. Still, I can worry a lot less about getting my index weighted down by binary files. The code is pretty basic at this point, but here it is:

// Check for binaries
$ckbin = 14;
while($ckbin <= 26){
$ck = chr($ckbin);
$cbin = substr_count($read, $ck);
if($cbin > 0){
echo "Killing off binary file URL: $url\n";
$kill = mysql_unbuffered_query("DELETE FROM search WHERE url_id='$url_id'");
continue 2;
}
++$ckbin;
}
I know it looks kind of funky out of context, but it works really great.


Nick

Richard Davey wrote:

Hello Evan,

Monday, February 23, 2004, 8:57:43 PM, you wrote:



It would be wise to check for characters from 0 to 31, if they appear
then it's almost certainly (but not guaranteed) binary.



EN> Assuming that's decimal, you're including 0x09 0x0a and 0x0d which are, EN> respectively, tab, line feed, and carriage return. That's off the top of my EN> head, which means two things: (1) i may be forgetting something, and (2) I EN> need a life ;)

Let me rephrase - check for the existence of characters 0 through 31
and count how many there are. Set a percentage weight yourself and
figure out in your script if you deem the quantity too many or too
few.

The count_chars() function will be absolutely ideal for this.



Reply via email to