Package: approx
Version: 3.0.0+b1
Severity: normal
Tags: patch

I'm using http://ftp.se.debian.org/debian/ as a mirror and it seems
that this site does a 302 redirection on some (but not all) requests.
Unfortunately this doesn't work well with approx, because approx does
not seem to handle redirections properly. My approx configuration file
/etc/approx/approx.conf contains the following lines:

$verbose        true
$debug          true
debian http://ftp.se.debian.org/debian


For example, the command "apt-get source unzip" gives, at the time of
writing, the following output. (Note that this example might become
invalid as it seems that it varies over time which requests the server
will be redirecting.)

# apt-get source unzip
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Skipping already downloaded file 'unzip_5.52-10.dsc'
Skipping already downloaded file 'unzip_5.52-10.diff.gz'
Need to get 1140kB of source archives.
Err http://localhost testing/main unzip 5.52-10 (tar)
  404 Not Found
Failed to fetch 
http://localhost:9999/debian/pool/main/u/unzip/unzip_5.52.orig.tar.gz  404 Not 
Found
E: Failed to fetch some archives.


Looking at the syslog gives an indication what went wrong:

Mar  5 22:05:25 haakonst1 approx: Version: 3.0.0+b1 
Mar  5 22:05:25 haakonst1 approx: Interface: any 
Mar  5 22:05:25 haakonst1 approx: Port: 9999 
Mar  5 22:05:25 haakonst1 approx: Max rate: unlimited 
Mar  5 22:05:25 haakonst1 approx: User: approx 
Mar  5 22:05:25 haakonst1 approx: Group: approx 
Mar  5 22:05:25 haakonst1 approx: Syslog: daemon 
Mar  5 22:05:25 haakonst1 approx: Pdiffs: true 
Mar  5 22:05:25 haakonst1 approx: Offline: false 
Mar  5 22:05:25 haakonst1 approx: Max wait: 10 
Mar  5 22:05:25 haakonst1 approx: Verbose: true 
Mar  5 22:05:25 haakonst1 approx: Debug: true 
Mar  5 22:05:30 haakonst1 approx: Connection from 127.0.0.1:60339 
Mar  5 22:05:30 haakonst1 approx: Request 
/debian/pool/main/u/unzip/unzip_5.52.orig.tar.gz 
Mar  5 22:05:30 haakonst1 approx:   Host: localhost:9999 
Mar  5 22:05:30 haakonst1 approx:   Connection: keep-alive 
Mar  5 22:05:30 haakonst1 approx:   User-Agent: Debian APT-HTTP/1.3 (0.7.11) 
Mar  5 22:05:30 haakonst1 approx:   => cache miss 
Mar  5 22:05:30 haakonst1 approx: Command: /usr/bin/curl --fail --silent 
--location  --include 
"http://ftp.se.debian.org/debian/pool/main/u/unzip/unzip_5.52.orig.tar.gz"; 
Mar  5 22:05:31 haakonst1 approx:   HTTP/1.1 302 Found 
Mar  5 22:05:31 haakonst1 approx:   Date: Wed, 05 Mar 2008 21:05:30 GMT 
Mar  5 22:05:31 haakonst1 approx:   Server: Apache/2.2.6 (Unix) 
Mar  5 22:05:31 haakonst1 approx:   Location: 
http://saimei.acc.umu.se/debian/pool/main/u/unzip/unzip_5.52.orig.tar.gz 
Mar  5 22:05:31 haakonst1 approx:   Content-Length: 373 
Mar  5 22:05:31 haakonst1 approx:   Content-Type: text/html; charset=iso-8859-1 
Mar  5 22:05:36 haakonst1 approx: Unexpected status code: 302 
Mar  5 22:05:36 haakonst1 approx: 
http://ftp.se.debian.org/debian/pool/main/u/unzip/unzip_5.52.orig.tar.gz: not 
found 


Approx invokes curl with the --location option, which will cause it to
follow the redirection. However, running this curl command as it was
invoked by approx gives the following initial output:

# /usr/bin/curl --fail --silent --location --include \
    http://ftp.se.debian.org/debian/pool/main/u/unzip/unzip_5.52.orig.tar.gz \
    | head -n 20

HTTP/1.1 302 Found
Date: Wed, 05 Mar 2008 21:15:15 GMT
Server: Apache/2.2.6 (Unix)
Location: 
http://saimei.acc.umu.se/debian/pool/main/u/unzip/unzip_5.52.orig.tar.gz
Content-Length: 373
Content-Type: text/html; charset=iso-8859-1

HTTP/1.1 200 OK
Date: Wed, 05 Mar 2008 21:15:15 GMT
Server: Apache/2.2.6 (Unix)
Last-Modified: Tue, 01 Mar 2005 14:47:04 GMT
ETag: "3ae57-116643-7852ea00"
Accept-Ranges: bytes
Content-Length: 1140291
Age: 2489
Content-Type: application/x-gzip

[...]

The problem seems to be that since the --include option is also
specified curl will output the HTTP headers from both responses (this
is also documented in the curl man page), but when looking at source,
the functions in url.ml for parsing the output from curl apparently
don't take this into account.

I've attached two alternative patches to attempt to fix this problem.
The first tries to handle a possible additional set of headers
depending on the status code of each HTTP response output by curl. For
this I also moved some of the header parsing code from approx.ml to
url.ml. This became quite unwieldy and required a lot of changes, so
the second patch tries to handle possible redirections directly by
itself, and skips the use of the --location option when curl is
invoked.

In any case, I'll admit I don't really know the OCaml language, so
even though these patches work for me, I won't be surprised if
there's some problem with them or if they can be discarded entirely
for something better.


-- System Information:
Debian Release: lenny/sid
  APT prefers unstable
  APT policy: (990, 'unstable'), (500, 'testing'), (500, 'stable')
Architecture: i386 (i686)

Kernel: Linux 2.6.22-2-686 (SMP w/2 CPU cores)
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/bash

Versions of packages approx depends on:
ii  adduser                       3.106      add and remove users and groups
ii  bzip2                         1.0.4-3    high-quality block-sorting file co
ii  curl                          7.18.0-1   Get a file from an HTTP, HTTPS or 
ii  libc6                         2.7-9      GNU C Library: Shared libraries
ii  libpcre3                      7.6-2      Perl 5 Compatible Regular Expressi
ii  lsb-base                      3.2-3      Linux Standard Base 3.2 init scrip

approx recommends no packages.

-- no debconf information

diff -Naur a/approx.ml b/approx.ml
--- a/approx.ml	2007-12-07 23:17:06.000000000 +0000
+++ b/approx.ml	2008-03-05 20:58:55.000000000 +0000
@@ -254,35 +254,20 @@
   close_cache resp.length resp.last_modified;
   if resp.length >= 0L then Delivered else Cached
 
-let with_pair rex str proc =
-  match Pcre.extract ~rex ~full_match: false str with
-  | [| a; b |] -> proc (a, b)
-  | _ -> assert false
-
-let status_re = Pcre.regexp "^HTTP/\\d+\\.\\d+\\s+(\\d{3})\\s+(.*?)\\s*$"
-let header_re = Pcre.regexp "^(.*?):\\s*(.*?)\\s*$"
-
-let process_header resp str =
-  let do_status (code, _) =
-    resp.status <- int_of_string code
-  in
-  let do_header (header, value) =
-    match String.lowercase header with
-    | "content-length" ->
-	(try resp.length <- Int64.of_string value
-	 with Failure _ ->
-	   error_message "Cannot parse Content-Length %s" value)
-    | "last-modified" ->
-	(try resp.last_modified <- Url.time_of_string value
-	 with Invalid_argument _ ->
-	   error_message "Cannot parse Last-Modified date %s" value)
-    | _ -> ()
-  in
-  if debug then debug_message "  %s" str;
-  try with_pair header_re str do_header
-  with Not_found ->  (* e.g., status line or CRLF *)
-    try with_pair status_re str do_status
-    with Not_found -> error_message "Unrecognized response: %s" str
+let process_header resp field =
+  match field with
+    | Url.Status code -> (resp.status <- code)
+    | Url.Header (header, value) ->
+        (match String.lowercase header with
+         | "content-length" ->
+     	(try resp.length <- Int64.of_string value
+     	 with Failure _ ->
+     	   error_message "Cannot parse Content-Length %s" value)
+         | "last-modified" ->
+     	(try resp.last_modified <- Url.time_of_string value
+     	 with Invalid_argument _ ->
+     	   error_message "Cannot parse Last-Modified date %s" value)
+         | _ -> ())
 
 let process_body resp cgi str pos len =
   if resp.status = 200 then
diff -Naur a/url.ml b/url.ml
--- a/url.ml	2007-12-07 23:17:06.000000000 +0000
+++ b/url.ml	2008-03-05 20:58:55.000000000 +0000
@@ -57,26 +57,58 @@
 
 let head_command = curl_command ["--head"]
 
+let with_pair rex str proc =
+  match Pcre.extract ~rex ~full_match: false str with
+  | [| a; b |] -> proc (a, b)
+  | _ -> assert false
+
+let status_re = Pcre.regexp "^HTTP/\\d+\\.\\d+\\s+(\\d{3})\\s+(.*?)\\s*$"
+let header_re = Pcre.regexp "^(.*?):\\s*(.*?)\\s*$"
+
+let next chan =
+  try
+    let header = (input_line chan) in
+    let n = String.length header in
+    if n > 0 && header.[n - 1] = '\r' then
+      Some (String.sub header 0 (n - 1))
+    else
+      begin
+	error_message "Unexpected header: %s" header;
+	None
+    end
+  with End_of_file -> None
+
+type header_field = Status of int | Header of string * string
+
 let iter_headers proc chan =
-  let next () =
-    try Some (input_line chan)
-    with End_of_file -> None
+  let rec skip () = (* skip headers until we see just CRLF *)
+    match next chan with
+    | Some header -> if String.length header > 0 then skip () else ()
+    | None -> ()
   in
   let rec loop () =
-    match next () with
+    match next chan with
     | Some header ->
-	let n = String.length header in
-	if n > 0 && header.[n - 1] = '\r' then
-	  if n > 1 then
-	    begin
-	      proc (String.sub header 0 (n - 1));
-	      loop ()
-	    end
-	  else
-	    (* CRLF terminates headers *)
-	    ()
-	else
-	  error_message "Unexpected header: %s" header
+      if String.length header > 0 then
+	let do_header (header, value) =
+	  proc (Header (header, value))
+	in
+	let do_status (status, _) =
+	let code = int_of_string status in
+	  match code with
+	  | 301 | 302 | 303 | 307 -> skip ()
+	  | _ -> proc (Status code)
+	in
+	if debug then debug_message "  %s" header;
+	(try with_pair header_re header do_header
+	   with Not_found ->  (* e.g., status line or CRLF *)
+	     try with_pair status_re header do_status
+	     with Not_found ->
+	       error_message "Unrecognized response: %s" header);
+	loop ()
+      else
+	(* CRLF terminates headers *)
+	()
     | None -> ()
   in
   loop ()
diff -Naur a/url.mli b/url.mli
--- a/url.mli	2007-12-07 23:17:06.000000000 +0000
+++ b/url.mli	2008-03-05 20:58:55.000000000 +0000
@@ -22,10 +22,12 @@
 
 val protocol : string -> protocol
 
+type header_field = Status of int | Header of string * string
+
 (* Perform HTTP HEAD (or equivalent for FTP and FILE) on the given URL
    and apply a callback to each header that is returned *)
 
-val head : string -> (string -> unit) -> unit
+val head : string -> (header_field -> unit) -> unit
 
 (* Download the specified URL with optional request headers,
    then apply callbacks to the headers and body chunks *)
@@ -33,7 +35,7 @@
 val download :
   string ->
   ?headers:string list ->
-  ?header_callback:(string -> unit) ->
+  ?header_callback:(header_field -> unit) ->
   (string -> int -> int -> unit) -> unit
 
 (* Download a file from a remote repository *)
diff -Naur a/approx.ml b/approx.ml
--- a/approx.ml	2007-12-07 23:17:06.000000000 +0000
+++ b/approx.ml	2008-03-05 20:59:10.000000000 +0000
@@ -37,6 +37,8 @@
   done;
   if not !foreground then use_syslog ()
 
+let max_redirects = 5
+
 let stat_file name =
   try Some (stat name) with Unix_error (ENOENT, "stat", _) -> None
 
@@ -239,6 +241,7 @@
     mutable status : int;
     mutable length : int64;
     mutable last_modified : float;
+    mutable location : string;
     mutable body_seen : bool }
 
 let initial_response_state =
@@ -246,6 +249,7 @@
     status = 0;
     length = -1L;
     last_modified = 0.;
+    location = "?";
     body_seen = false }
 
 let new_response name = { initial_response_state with name = name }
@@ -276,6 +280,7 @@
 	(try resp.last_modified <- Url.time_of_string value
 	 with Invalid_argument _ ->
 	   error_message "Cannot parse Last-Modified date %s" value)
+    | "location" -> (resp.location <- value)
     | _ -> ()
   in
   if debug then debug_message "  %s" str;
@@ -306,15 +311,26 @@
   let headers =
     if ims > 0. then [ "If-Modified-Since: " ^ Url.string_of_time ims ] else []
   in
-  let resp = new_response name in
-  let header_callback = process_header resp in
-  let body_callback = process_body resp cgi in
-  Url.download url ~headers ~header_callback body_callback;
-  match resp.status with
-  | 200 -> finish_delivery resp
-  | 304 -> Not_modified
-  | 404 -> File_not_found
-  | n -> error_message "Unexpected status code: %d" n; File_not_found
+  let rec download_redirect redirects url =
+    let resp = new_response name in
+    let header_callback = process_header resp in
+    let body_callback = process_body resp cgi in
+    Url.download url ~headers ~header_callback body_callback;
+    match resp.status with
+    | 200 -> finish_delivery resp
+    | 301 | 302 | 303 | 307 ->
+        if redirects <= max_redirects then
+          download_redirect (redirects+1) resp.location
+        else
+          begin
+            error_message "Too many redirections";
+            File_not_found
+          end
+    | 304 -> Not_modified
+    | 404 -> File_not_found
+    | n -> error_message "Unexpected status code: %d" n; File_not_found
+  in
+  download_redirect 0 url
 
 (* Download a file from an FTP repository *)
 
diff -Naur a/url.ml b/url.ml
--- a/url.ml	2007-12-07 23:17:06.000000000 +0000
+++ b/url.ml	2008-03-05 20:59:10.000000000 +0000
@@ -52,7 +52,7 @@
   | str -> "--limit-rate " ^ str
 
 let curl_command options url =
-  sprintf "/usr/bin/curl --fail --silent --location %s %s %s"
+  sprintf "/usr/bin/curl --fail --silent %s %s %s"
     rate_option (String.concat " " options) (quoted_string url)
 
 let head_command = curl_command ["--head"]

Reply via email to