[PHP] Detecting Binaries

2004-02-22 Thread Axel IS Main
I'm using file_get_contents() to open URLs. Does anyone know if there is 
a way to look at the result and determine if the file is binary? I'd 
like to be able to block binaries from being processed without having to 
try to think of all the possible binary extensions and omit them with a 
function that looks for these extensions.

Nick

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Detecting Binaries

2004-02-23 Thread Axel IS Main
Guys, this isn't THAT stupid of a question is it? From my perspective, 
the way PHP seems to see it is that I should already know what kind of 
file I'm looking at. In most cases that's not an unreasonable 
assumption. Unfortunately, that's only good for most cases. PHP is rich 
in ways to work with the HTTP protocol, but has no way of detecting 
whether it's opening a text file or a binary file. To me this is a 
glaring omission. There has to be a way to do it, even if it's a 
round-a-bout or backdoor kind of way. Nothing is impossible.

Nick

Axel IS Main wrote:

I'm using file_get_contents() to open URLs. Does anyone know if there 
is a way to look at the result and determine if the file is binary? 
I'd like to be able to block binaries from being processed without 
having to try to think of all the possible binary extensions and omit 
them with a function that looks for these extensions.

Nick

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Detecting Binaries

2004-02-23 Thread Axel IS Main
Yes, and in fact that is what I am doing now. This is a spider bot 
though, so I'm having to think of every single type of binary file that 
could be linked to on the web. So far I'm up to 28 with no end in sight. 
What about a .com file? I can't omit links that end in .com can I? That 
would be counterproductive to say the least. Also, the function that 
does the checking just keep getting longer and longer, which makes the 
spider go slower and slower. Granted, the thing is pretty fast if it has 
enough BW to work with, but still. This could eventually turn into a 
script killer. Detecting whether the stream from file_get_contents(), or 
fopen() for that matter, is binary or not and going with that result is 
the elegant solution to this problem. There has to be a way to do it.

Nick

Adam Voigt wrote:

Couldn't you just check the extension on the file?

On Mon, 2004-02-23 at 14:03, Axel IS Main wrote:
 

Guys, this isn't THAT stupid of a question is it? From my perspective, 
the way PHP seems to see it is that I should already know what kind of 
file I'm looking at. In most cases that's not an unreasonable 
assumption. Unfortunately, that's only good for most cases. PHP is rich 
in ways to work with the HTTP protocol, but has no way of detecting 
whether it's opening a text file or a binary file. To me this is a 
glaring omission. There has to be a way to do it, even if it's a 
round-a-bout or backdoor kind of way. Nothing is impossible.

Nick

Axel IS Main wrote:

   

I'm using file_get_contents() to open URLs. Does anyone know if there 
is a way to look at the result and determine if the file is binary? 
I'd like to be able to block binaries from being processed without 
having to try to think of all the possible binary extensions and omit 
them with a function that looks for these extensions.

Nick

 

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Detecting Binaries

2004-02-23 Thread Axel IS Main
Thanks, that's very helpful. It beats the heck out of doing it the way 
I've been doing it.

Richard Davey wrote:

Hello Axel,

Monday, February 23, 2004, 7:38:25 PM, you wrote:

AIM> Thanks, you just gave me the solution, I think. I don't have to strip
AIM> out every character above standard ascii, I just have to look for them.
AIM> If one is there, then just get rid of it. It's true that an OS can't
AIM> tell the difference between a jpg and an exe file, but that's to be
AIM> expected. But the file_get_contents() function DOES open the file. Since
AIM> there is a definite difference between a text file and a binary file, it
AIM> should be able to detect that.
The difference isn't as obvious as you might think. Opening a binary
file into a hex editor will show you this. Your brain can determine if
the codes in-front of you are "English" or not, but from a pure logic
point of view that's a little harder.
Also bear in mind that on Unix ALL files are binary files. It is up to
you to determine the type of the file contents as you see fit. For
example you can check for line-terminated data.
It would be wise to check for characters from 0 to 31, if they appear
then it's almost certainly (but not guaranteed) binary.
 

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP] Detecting Binaries

2004-02-23 Thread Axel IS Main
That's not bad, but I found a way to do it simply using chr() and 
passing it a value. It turns out the if I go 0-31 Almost nothing will 
get through. Even the simples html has something in there from that 
list. However, by just looking between 14 and 26, one more than carriage 
return, and one less than escape, it worked really well. I crawled a 
site with a large number of jpg, gif, mp3, wav, and pdf files. Of the 
100's of binaries there only one pdf got through. Not a bad record. I 
also found that in order for this to work I have to process the URLs. 
This makes things really slow so I'm going to have to use both this and 
the "check for extension" function together. Still, I can worry a lot 
less about getting my index weighted down by binary files. The code is 
pretty basic at this point, but here it is:

   // Check for binaries
   $ckbin = 14;
   while($ckbin <= 26){
   $ck = chr($ckbin);
   $cbin = substr_count($read, $ck);
   if($cbin > 0){
   echo "Killing off binary file URL: $url\n";
   $kill = mysql_unbuffered_query("DELETE FROM search WHERE 
url_id='$url_id'");
   continue 2;
   }
   ++$ckbin;
   }
I know it looks kind of funky out of context, but it works really great.

Nick

Richard Davey wrote:

Hello Evan,

Monday, February 23, 2004, 8:57:43 PM, you wrote:

 

It would be wise to check for characters from 0 to 31, if they appear
then it's almost certainly (but not guaranteed) binary.
 

EN> Assuming that's decimal, you're including 0x09 0x0a and 0x0d which are,
EN> respectively, tab, line feed, and carriage return. That's off the top of my
EN> head, which means two things: (1) i may be forgetting something, and (2) I
EN> need a life ;)
Let me rephrase - check for the existence of characters 0 through 31
and count how many there are. Set a percentage weight yourself and
figure out in your script if you deem the quantity too many or too
few.
The count_chars() function will be absolutely ideal for this.

 



[PHP] Interesting phpversion() thing.

2004-03-30 Thread Axel IS Main
I just upgraded to 4.3.5. I double checked and made sure I put 
everything in the right place. If I run php or php-cli from the command 
line and the script has phpversion() in it, it returns the correct 
string, i.e. 4.3.5. If, however, I pull the same script up in a browser 
it gives me 4.3.4. I've tried everything, clearing caches, etc. Can't 
seem to get it to do what I expect. Any one else see this?

Nick

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php