Re: HttpClient 4.0 encoding madness

sebb Sat, 30 Jan 2010 04:47:23 -0800

On 30/01/2010, amoldavsky <[email protected]> wrote:
>
>  Hi,
>
>  This solution worked out very well:


Glad you finally got there.

>  byte[] tmp = new byte[bufferSize];
>
>     int bytesRead;
>     try
>     {
>       if((bytesRead = chunkedIns.read(tmp)) != -1)
>
>       {
>         return new String(tmp, 0, bytesRead);
>       }
>
>       else
>       {
>         finish();
>         return null;
>       }
>     }
>     catch(IOException e)
>     {
>       HTTPClientException e2 = new HTTPClientException(e.getMessage());
>       e2.setStackTrace(e.getStackTrace());
>       throw e2;
>     }
>
>
>
> If it's not too much of a trouble would anybody please explain to me why is
>  it possible that the buffer maybe not be 100% full when I read it? I think
>  it's all depends on how the implementation was done (in this case by Sun),
>  and if Sun decided to implement buffering this way I don't understand the
>  logic behind it.

In that case you had better ask the Oracle.

>
>  Thank you very much Oleg, Ken and Seb-2-2 for your earlier inputs!
>
>
>
>
>  sebb-2-2 wrote:
>  >
>  > On 29/01/2010, amoldavsky <[email protected]> wrote:
>  >>
>  >>  Hi Oleg,
>  >>
>  >>  Let me rephrase the question in better terms:
>  >>  If the server document is Y and buffer size is X, let's even assume that
>  >> Y =
>  >>  kX where X < Y, is it possible that any buffer 0 < x < (k-1) will not be
>  >>  fully filled?
>  >
>  > Remember that HTTP packets may be broken up in transit.
>  >
>  > However, even without that, it's never safe to assume that a buffer is
>  > filled.
>  >
>  > That's what the return value from read(buffer) is for - it tells you
>  > how many bytes are available.
>  >
>  >>  Thanks!
>  >>  -Assaf
>  >>
>  >>
>  >>
>  >>  Ken Krugler wrote:
>  >>  >
>  >>  >
>  >>  > On Jan 28, 2010, at 10:09pm, amoldavsky wrote:
>  >>  >
>  >>  >>
>  >>  >> Hi Oleg,
>  >>  >> Thank you for the quick reply.
>  >>  >>
>  >>  >> So if there is a possibility that not the whole buffer is filled how
>  >>  >> can I
>  >>  >> insure or force HttpClient to fill the whole buffer? Should I maybe
>  >>  >> avoid
>  >>  >> Stream Readers all together?
>  >>  >
>  >>  > If bufferSize is X, and the server document you're fetching has Y
>  >>  > bytes, then what do you mean by "force HttpClient to fill the whole
>  >>  > buffer"?
>  >>  >
>  >>  > At a minimum, you'd want
>  >>  >
>  >>  > int bytesRead = chunkedIns.read(tmp);
>  >>  > if (bytesRead != -1) {
>  >>  >     return new String(tmp, 0, bytesRead);
>  >>  > }
>  >>  >
>  >>  > But that also uses the platform default encoding for the character
>  >>  > set, which often won't be correct.
>  >>  >
>  >>  > -- Ken
>  >>  >
>  >>  >>
>  >>  >> olegk wrote:
>  >>  >>>
>  >>  >>> On Wed, 2010-01-27 at 20:24 -0800, amoldavsky wrote:
>  >>  >>>> Hi
>  >>  >>>>
>  >>  >>>> I have coded a simple file downloader using HttpClient 4.0.
>  >>  >>>> It works fine but there is something wrong with the String
>  >>  >>>> encoding or
>  >>  >>>> the
>  >>  >>>> buffer stream. The problem is that there are long sequences of
>  >>  >>>> "NULL"
>  >>  >>>> (ANSI
>  >>  >>>> code 00) through out the final file, like this:
>  >>  >>>> http://old.nabble.com/file/p27350930/httpclient_error01.jpg
>  >>  >>>> http://old.nabble.com/file/p27350930/httpclient_error02.jpg
>  >>  >>>>
>  >>  >>>> Here is the main code:
>  >>  >>>>
>  >>  >>>> public String getChunk(String url, int bufferSize) throws
>  >>  >>>> HTTPClientException
>  >>  >>>>  {
>  >>  >>>>    if(!chunkedStarted)
>  >>  >>>>    {
>  >>  >>>>      chunkedIns = getInputStream(url);
>  >>  >>>>      chunkedStarted = true;
>  >>  >>>>    }
>  >>  >>>>
>  >>  >>>>    byte[] tmp = new byte[bufferSize];
>  >>  >>>>    try
>  >>  >>>>    {
>  >>  >>>>      if(chunkedIns.read(tmp) != -1)
>  >>  >>>>      {
>  >>  >>>
>  >>  >>> What makes you think that the entire buffer will be filled with
>  >> data?
>  >>  >>>
>  >>  >>> Oleg
>  >>  >>>
>  >>  >>>
>  >>  >>>>        return new String(tmp);
>  >>  >>>>      }
>  >>  >>>>      else
>  >>  >>>>      {
>  >>  >>>>        finish();
>  >>  >>>>        return null;
>  >>  >>>>      }
>  >>  >>>>    }
>  >>  >>>>    catch(IOException e)
>  >>  >>>>    {
>  >>  >>>>      HTTPClientException e2 = new
>  >>  >>>> HTTPClientException(e.getMessage());
>  >>  >>>>      e2.setStackTrace(e.getStackTrace());
>  >>  >>>>      throw e2;
>  >>  >>>>    }
>  >>  >>>>  }
>  >>  >>>>
>  >>  >>>>  public void finish()
>  >>  >>>>  {
>  >>  >>>>    // do some cleaning
>  >>  >>>>  }
>  >>  >>>>
>  >>  >>>>   private InputStream getInputStream(String url) throws
>  >>  >>>> HTTPClientException
>  >>  >>>>  {
>  >>  >>>>    InputStream instream = null;
>  >>  >>>>
>  >>  >>>>    httpClient = new DefaultHttpClient();
>  >>  >>>>    httpClient.getParams().setParameter("http.useragent",
>  >>  >>>> AGENT_NAME);
>  >>  >>>>
>  >>  >>>>    HttpGet httpGet = new HttpGet(url);
>  >>  >>>>    HttpResponse response = null;
>  >>  >>>>
>  >>  >>>>    try
>  >>  >>>>    {
>  >>  >>>>      response = httpClient.execute(httpGet);
>  >>  >>>>      HttpEntity entity = response.getEntity();
>  >>  >>>>
>  >>  >>>>      if(entity != null)
>  >>  >>>>      {
>  >>  >>>>        instream = entity.getContent();
>  >>  >>>>      }
>  >>  >>>>    }
>  >>  >>>>    catch(ClientProtocolException e)
>  >>  >>>>    {
>  >>  >>>>      HTTPClientException e2 = new
>  >>  >>>> HTTPClientException(e.getMessage());
>  >>  >>>>      e2.setStackTrace(e.getStackTrace());
>  >>  >>>>      throw e2;
>  >>  >>>>    }
>  >>  >>>>    catch(IOException e)
>  >>  >>>>    {
>  >>  >>>>      HTTPClientException e2 = new
>  >>  >>>> HTTPClientException(e.getMessage());
>  >>  >>>>      e2.setStackTrace(e.getStackTrace());
>  >>  >>>>      throw e2;
>  >>  >>>>    }
>  >>  >>>>
>  >>  >>>>    return instream;
>  >>  >>>>  }
>  >>  >>>>
>  >>  >>>> getChuck and getInputStream can basically be one method but I just
>  >>  >>>> have
>  >>  >>>> the
>  >>  >>>> need to split them for internal conveniece, that does not change
>  >> the
>  >>  >>>> funtionality as a whole.
>  >>  >>>>
>  >>  >>>> It seems like either the conversion from bytes to string is a
>  >>  >>>> problem:
>  >>  >>>> return new String(tmp);
>  >>  >>>>
>  >>  >>>> or that the buffer is not getting filled to the end. The latter
>  >>  >>>> could not
>  >>  >>>> be
>  >>  >>>> possible because the files are ~30MB each and the buffer size is
>  >>  >>>> 2Kb.
>  >>  >>>>
>  >>  >>>> I have attached the file, it's a CSV (shortened to ~6KB), note
>  >>  >>>> that long
>  >>  >>>> white space between some of the URLs, if you just remove it, the
>  >> URL
>  >>  >>>> makes
>  >>  >>>> sense.
>  >>  >>>> http://old.nabble.com/file/p27350930/datafeed.csv datafeed.csv
>  >>  >>>>
>  >>  >>>> Where can this white space come (null) from??
>  >>  >>>>
>  >>  >>>> thank!
>  >>  >>>
>  >>  >>>
>  >>  >>>
>  >>  >>>
>  >> ---------------------------------------------------------------------
>  >>  >>> To unsubscribe, e-mail: [email protected]
>  >>  >>> For additional commands, e-mail: [email protected]
>  >>  >>>
>  >>  >>>
>  >>  >>>
>  >>  >>
>  >>  >> --
>  >>  >> View this message in context:
>  >>  >>
>  >> 
> http://old.nabble.com/HttpClient-4.0-encoding-madness-tp27350930p27366928.html
>  >>  >> Sent from the HttpClient-User mailing list archive at Nabble.com.
>  >>  >>
>  >>  >>
>  >>  >> ---------------------------------------------------------------------
>  >>  >> To unsubscribe, e-mail: [email protected]
>  >>  >> For additional commands, e-mail: [email protected]
>  >>  >>
>  >>  >
>  >>  > --------------------------------------------
>  >>  > Ken Krugler
>  >>  > +1 530-210-6378
>  >>  > http://bixolabs.com
>  >>  > e l a s t i c   w e b   m i n i n g
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>  >
>  >>
>  >>  --
>  >>
>  >> View this message in context:
>  >> 
> http://old.nabble.com/HttpClient-4.0-encoding-madness-tp27350930p27377093.html
>  >>
>  >> Sent from the HttpClient-User mailing list archive at Nabble.com.
>  >>
>  >>
>  >>  ---------------------------------------------------------------------
>  >>  To unsubscribe, e-mail: [email protected]
>  >>  For additional commands, e-mail: [email protected]
>  >>
>  >>
>  >
>  > ---------------------------------------------------------------------
>  > To unsubscribe, e-mail: [email protected]
>  > For additional commands, e-mail: [email protected]
>  >
>  >
>  >
>
>  --
>
> View this message in context: 
> http://old.nabble.com/HttpClient-4.0-encoding-madness-tp27350930p27381546.html
>
> Sent from the HttpClient-User mailing list archive at Nabble.com.
>
>
>  ---------------------------------------------------------------------
>  To unsubscribe, e-mail: [email protected]
>  For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: HttpClient 4.0 encoding madness

Reply via email to