According to Geoff Hutchison:
> On Tue, 8 Feb 2000 [EMAIL PROTECTED] wrote:
> 
> >  > given that rest is char * and not const char *?  What will (String) =
> >  > (char *) default to?
> > 
> >  No it's not. The opposite would be, barking when (String) = (const char*).
> 
> Uhm. I hope I'm not the only one confused by your statement Loic. Are you
> saying that this code should be OK?

I'm pretty sure he meant the code is OK at that point.

> If so, someone else please take a look at that section of Server.cc. I
> don't see why it would present "/foobar/" instead of "/cat/|/foobar/"
> unless something is fouling up that conditional.

Found it!

>Trying to retrieve robots.txt file
>Parsing robots.txt file using myname = htdig
>Found 'user-agent' line: htdig
>Found 'disallow' line: /cat/
>Found 'user-agent' line: htdig
>Found 'disallow' line: /foobar/
>Pattern: /foobar/

The problem is the second User-agent line, which causes the previous
pattern to be cleared.  I don't know whether the robots.txt file or the
parsing is incorrect, but one or the other has to change.  The code expects
that a line of "User-agent: htdig" will begin a new User-agent section
which will override any previous User-agent sections.  If it's correct
form to have multiple User-agent sections for a given User-agent, then
the code is wrong.  If the standard requires that all Disallow entries
for one User-agent fall under a single User-agent heading, then the
file above is incorrect.

The old standard was vague on this point, but the examples never showed
more than one User-agent field bearing the same name.  It says the robot
"should be liberal in interpreting this field."  But it also says, in
regards to the "User-agent: *" record, that it is "not allowed to have
multiple such records".

According to the new draft standard, it would appear that both the file
above and the current code are incorrect - it should only use the FIRST
matching section...


3.2.1 The User-agent line

   Name tokens are used to allow robots to identify themselves via a
   simple product token. Name tokens should be short and to the
   point. The name token a robot chooses for itself should be sent
   as part of the HTTP User-agent header, and must be well documented.

   These name tokens are used in User-agent lines in /robots.txt to
   identify to which specific robots the record applies. The robot
   must obey the first record in /robots.txt that contains a User-
   Agent line whose value contains the name token of the robot as a 
   substring. The name comparisons are case-insensitive. If no such
   record exists, it should obey the first record with a User-agent
   line with a "*" value, if present. If no record satisfied either
   condition, or no records are present at all, access is unlimited.


To implement this we should do the following when the name matches:

                if (!seen_mynme)
                {
                    seen_myname = 1;
                    pay_attention = 1;
                    pattern = 0;
                }
                else
                    pay_attention = 0;

If we don't want a rigorous implementation of the draft, which also
defines use of the Allow record, but want instead a more liberal
interpretation, we can leave off the else clause, and it will continue
to accept lines from an adjacent User-agent section.  To be even more
liberal, and not even require that the sections be adjacent, we should
always set pay_attention to 1 when the name matches, but only clear the
pattern when the name is first seen.

The docs on http://info.webcrawler.com/mak/projects/robots/robots.html
suggest that the draft specification "is not yet completed or implemented,"
so I don't know how rigorously we'd want to enforce it.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to