[PHP] Detecting Binaries
I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Guys, this isn't THAT stupid of a question is it? From my perspective, the way PHP seems to see it is that I should already know what kind of file I'm looking at. In most cases that's not an unreasonable assumption. Unfortunately, that's only good for most cases. PHP is rich in ways to work with the HTTP protocol, but has no way of detecting whether it's opening a text file or a binary file. To me this is a glaring omission. There has to be a way to do it, even if it's a round-a-bout or backdoor kind of way. Nothing is impossible. Nick Axel IS Main wrote: I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Yes, and in fact that is what I am doing now. This is a spider bot though, so I'm having to think of every single type of binary file that could be linked to on the web. So far I'm up to 28 with no end in sight. What about a .com file? I can't omit links that end in .com can I? That would be counterproductive to say the least. Also, the function that does the checking just keep getting longer and longer, which makes the spider go slower and slower. Granted, the thing is pretty fast if it has enough BW to work with, but still. This could eventually turn into a script killer. Detecting whether the stream from file_get_contents(), or fopen() for that matter, is binary or not and going with that result is the elegant solution to this problem. There has to be a way to do it. Nick Adam Voigt wrote: Couldn't you just check the extension on the file? On Mon, 2004-02-23 at 14:03, Axel IS Main wrote: Guys, this isn't THAT stupid of a question is it? From my perspective, the way PHP seems to see it is that I should already know what kind of file I'm looking at. In most cases that's not an unreasonable assumption. Unfortunately, that's only good for most cases. PHP is rich in ways to work with the HTTP protocol, but has no way of detecting whether it's opening a text file or a binary file. To me this is a glaring omission. There has to be a way to do it, even if it's a round-a-bout or backdoor kind of way. Nothing is impossible. Nick Axel IS Main wrote: I'm using file_get_contents() to open URLs. Does anyone know if there is a way to look at the result and determine if the file is binary? I'd like to be able to block binaries from being processed without having to try to think of all the possible binary extensions and omit them with a function that looks for these extensions. Nick -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
Thanks, that's very helpful. It beats the heck out of doing it the way I've been doing it. Richard Davey wrote: Hello Axel, Monday, February 23, 2004, 7:38:25 PM, you wrote: AIM> Thanks, you just gave me the solution, I think. I don't have to strip AIM> out every character above standard ascii, I just have to look for them. AIM> If one is there, then just get rid of it. It's true that an OS can't AIM> tell the difference between a jpg and an exe file, but that's to be AIM> expected. But the file_get_contents() function DOES open the file. Since AIM> there is a definite difference between a text file and a binary file, it AIM> should be able to detect that. The difference isn't as obvious as you might think. Opening a binary file into a hex editor will show you this. Your brain can determine if the codes in-front of you are "English" or not, but from a pure logic point of view that's a little harder. Also bear in mind that on Unix ALL files are binary files. It is up to you to determine the type of the file contents as you see fit. For example you can check for line-terminated data. It would be wise to check for characters from 0 to 31, if they appear then it's almost certainly (but not guaranteed) binary. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Detecting Binaries
That's not bad, but I found a way to do it simply using chr() and passing it a value. It turns out the if I go 0-31 Almost nothing will get through. Even the simples html has something in there from that list. However, by just looking between 14 and 26, one more than carriage return, and one less than escape, it worked really well. I crawled a site with a large number of jpg, gif, mp3, wav, and pdf files. Of the 100's of binaries there only one pdf got through. Not a bad record. I also found that in order for this to work I have to process the URLs. This makes things really slow so I'm going to have to use both this and the "check for extension" function together. Still, I can worry a lot less about getting my index weighted down by binary files. The code is pretty basic at this point, but here it is: // Check for binaries $ckbin = 14; while($ckbin <= 26){ $ck = chr($ckbin); $cbin = substr_count($read, $ck); if($cbin > 0){ echo "Killing off binary file URL: $url\n"; $kill = mysql_unbuffered_query("DELETE FROM search WHERE url_id='$url_id'"); continue 2; } ++$ckbin; } I know it looks kind of funky out of context, but it works really great. Nick Richard Davey wrote: Hello Evan, Monday, February 23, 2004, 8:57:43 PM, you wrote: It would be wise to check for characters from 0 to 31, if they appear then it's almost certainly (but not guaranteed) binary. EN> Assuming that's decimal, you're including 0x09 0x0a and 0x0d which are, EN> respectively, tab, line feed, and carriage return. That's off the top of my EN> head, which means two things: (1) i may be forgetting something, and (2) I EN> need a life ;) Let me rephrase - check for the existence of characters 0 through 31 and count how many there are. Set a percentage weight yourself and figure out in your script if you deem the quantity too many or too few. The count_chars() function will be absolutely ideal for this.
[PHP] Interesting phpversion() thing.
I just upgraded to 4.3.5. I double checked and made sure I put everything in the right place. If I run php or php-cli from the command line and the script has phpversion() in it, it returns the correct string, i.e. 4.3.5. If, however, I pull the same script up in a browser it gives me 4.3.4. I've tried everything, clearing caches, etc. Can't seem to get it to do what I expect. Any one else see this? Nick -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php