I strongly need you to test the patch I wrote, please. It's very important.
I have been thinking about the Retriever code. Waiting for the new Configuration to be available, where we will be able to set attributes for each server, now persistent connections are configurable only for the whole retrieving system.
The starting point is that every server has a flag for persistent connections. How can we use persistent connections? If we choose to use them, of course ... and if the remote server can accomplish our request (obviously). The first case, can be done through the initial configuration (both the actual and the future): and so, no problem.
The second can be discovered only after the first HTTP request. And when is this? Just inside the Server constructor, for retrieving the robots.txt file. After this call, we are able to determine if a server supports persistent connections or not, and so to set the flag _persistent_connections of the Server class.
So we now know if we can ask a server for a document only once (the server don't use pcs) or for several times. Here in my patch, I let do it for the infinite but, as Geoff suggests, we can issue a new attribute "server_repeat_connections" (or maybe another name, like "max_consecutive_requests") to determine how many requests I can do consecutively. The default may be -1 (infinite) but we can set the maximum. Let me know if you vote for it or not.
In order to do that, I modified the Retriever.cc code (obviously), but I also modified the Server.cc code (in order to let it really decide if a server can accomplish persistent connections or not). In order to do that, I have had to modify the Document.h code and add a public method for getting the pointer to the HTTPConnect private attribute (HtHTTP *GetHTTPHandler()).
So, we have 3 loops and from the outer they are:
1) while (more && noSignal)
2) while ( (server = (Server *)servers.Get_NextElement()) && noSignal)
And 3)
+ while ( ( (max_repeat_requests ==-1) ||
+ (count < max_repeat_requests) ) &&
+ (ref = server->pop()) && noSignal)
+ {
I hope I have been clear. Please try the patch and let me know what you think about it, so I can commit it as soon as possible and include persistent connections in the new release.
It seems to work on my environment, but I am going to leave it running the night (I'm leaving from work just now). So tomorrow I can be more exact.
But please, please, please ... TRY IT !!! It won't crash your machine, I swear.
And let me know for the attribute and maybe the right name.
Ciao ciao :-)
-Gabriele
Index: htdig/Document.h
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdig/Document.h,v
retrieving revision 1.10.2.5
diff -3 -u -p -r1.10.2.5 Document.h
--- htdig/Document.h 2000/01/14 01:23:43 1.10.2.5
+++ htdig/Document.h 2000/01/20 17:49:44
@@ -74,6 +74,8 @@ public:
//
void setUsernamePassword(const char *credentials)
{ authorization = credentials;}
+
+ HtHTTP *GetHTTPHandler() { return HTTPConnect; }
private:
enum
Index: htdig/Retriever.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdig/Retriever.cc,v
retrieving revision 1.72.2.15
diff -3 -u -p -r1.72.2.15 Retriever.cc
--- htdig/Retriever.cc 2000/01/20 03:55:47 1.72.2.15
+++ htdig/Retriever.cc 2000/01/20 17:49:46
@@ -297,43 +297,83 @@ Retriever::Start()
while (more && noSignal)
{
- more = 0;
+ more = 0;
//
- // Go through all the current servers in sequence. We take only one
- // URL from each server during this loop. This ensures that the load
- // on the servers is distributed evenly.
+ // Go through all the current servers in sequence.
+ // If they support persistent connections, we keep on popping
+ // from the server queue until we reach a maximum number of
+ // consecutive requests (so we will probably have to issue a new
+ // attribute, like "server_repeat_connections"). Or the loop may
+ // continue for the infinite, if we set the max to -1 (and maybe
+ // the attribute too).
+ // If the server doesn't support persistent connection, we take
+ // only an URL from it, then we skip to the next server.
//
+
+ // Let's position at the beginning
servers.Start_Get();
+
+ int count;
+
+ // Maximum number of repeated requests with the same
+ // socket connection.
+ int max_repeat_requests;
+
while ( (server = (Server *)servers.Get_NextElement()) && noSignal)
{
if (debug > 1)
- cout << "pick: " << server->host() << ", # servers = " <<
+ cout << "pick: " << server->host() << ", # servers = " <<
servers.Count() << endl;
- ref = server->pop();
- if (!ref)
- continue; // Nothing on this server
- // There may be no more documents, or the server
- // has passed the server_max_docs limit
-
- //
- // We have a URL to index, now. We need to register the
- // fact that we are not done yet by setting the 'more'
- // variable.
- //
- more = 1;
-
- //
- // Deal with the actual URL.
- // We'll check with the server to see if we need to sleep()
- // before parsing it.
- //
- server->delay(); // This will pause if needed and reset the time
- parse_url(*ref);
- delete ref;
- }
+ // and if the Server doesn't support persistent connections
+ // turn it down to 1.
+
+ // We already know if a server supports HTTP pers. connections,
+ // because we asked it for the robots.txt file (constructor of
+ // the class).
+
+ if (server->IsPersistentConnectionAllowed())
+ // Once the new attribute is set
+ // max_repeat_requests=config["server_repeat_connections"];
+ max_repeat_requests = -1; // Set to -1 (infinite loop)
+ else
+ max_repeat_requests = 1;
+
+ count = 0;
+
+ while ( ( (max_repeat_requests ==-1) ||
+ (count < max_repeat_requests) ) &&
+ (ref = server->pop()) && noSignal)
+ {
+ count ++;
+
+ //
+ // We have a URL to index, now. We need to register the
+ // fact that we are not done yet by setting the 'more'
+ // variable. So, we have to restart scanning the queue.
+ //
+
+ more = 1;
+
+ //
+ // Deal with the actual URL.
+ // We'll check with the server to see if we need to sleep()
+ // before parsing it.
+ //
+
+ parse_url(*ref);
+ delete ref;
+
+ // No HTTP connections available, so we change server and pause
+ if (max_repeat_requests == 1)
+ server->delay(); // This will pause if needed
+ // and reset the time
+
+ }
+ }
}
+
// if we exited on signal
if (Retriever_noLog != log && !noSignal)
{
Index: htdig/Server.cc
===================================================================
RCS file: /opt/htdig/cvs/htdig3/htdig/Server.cc,v
retrieving revision 1.17.2.6
diff -3 -u -p -r1.17.2.6 Server.cc
--- htdig/Server.cc 1999/12/11 16:19:47 1.17.2.6
+++ htdig/Server.cc 2000/01/20 17:49:47
@@ -21,6 +21,7 @@
#include "Document.h"
#include "URLRef.h"
#include "Transport.h"
+#include "HtHTTP.h" // for checking persistent connections
#include <ctype.h>
@@ -38,8 +39,10 @@ Server::Server(URL u, String *local_robo
_port = u.port();
_bad_server = 0;
_documents = 0;
- _persistent_connections = 1; // Allowed by default
+ // We take it from the configuration
+ _persistent_connections = config.Boolean("persistent_connections");
+
_max_documents = config.Value("server",_host,"server_max_docs", -1);
_connection_space = config.Value("server",_host,"server_wait_time", 0);
_last_connection.SettoNow(); // For getting robots.txt
@@ -78,7 +81,23 @@ Server::Server(URL u, String *local_robo
}
}
else if (!local_urls_only)
+ {
status = doc.Retrieve(timeZero);
+
+ // Let's check if persistent connections are both
+ // allowed by the configuration and possible after
+ // having requested the robots.txt file.
+
+ HtHTTP *http;
+ if (IsPersistentConnectionAllowed() &&
+ (http = doc.GetHTTPHandler()))
+ {
+ if (! http->isPersistentConnectionPossible())
+ _persistent_connections=0; // not possible. Let's disable
+ // them on this server.
+ }
+
+ }
else
status = Transport::Document_not_found;
-------------------------------------------------
Gabriele Bartolini
Computer Programmer (are U sure?)
U.O. Rete Civica - Comune di Prato
Prato - Italia - Europa
e-mail: [EMAIL PROTECTED]
http://www.po-net.prato.it
-------------------------------------------------
Zinedine "Zizou" Zidane. Just for soccer lovers.
-------------------------------------------------
-------------------------------------------------
------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
