date:20220508

Re: tail

2022-05-08 Thread Barry

> On 7 May 2022, at 17:29, Marco Sulla  wrote:
> 
> On Sat, 7 May 2022 at 16:08, Barry  wrote:
>> You need to handle the file in bin mode and do the handling of line endings 
>> and encodings yourself. It’s not that hard for the cases you wanted.
> 
 "\n".encode("utf-16")
> b'\xff\xfe\n\x00'
 "".encode("utf-16")
> b'\xff\xfe'
 "a\nb".encode("utf-16")
> b'\xff\xfea\x00\n\x00b\x00'
 "\n".encode("utf-16").lstrip("".encode("utf-16"))
> b'\n\x00'
> 
> Can I use the last trick to get the encoding of a LF or a CR in any encoding?

In a word no.

There are cases that you just have to know the encoding you are working with.
utf-16 because you have deal with the data in 2 byte units and know if
it is big endian or little endian.

There will be other encoding that will also be difficult.

But if you are working with encoding that are using ASCII as a base,
like unicode encoded as utf-8 or iso-8859 series then you can just look
for NL and CR using the ASCII values of the byte.

In short once you set your requirements then you can know what problems
you can avoid and which you must solve.

Is utf-16 important to you? If not no need to solve its issues.

Barry

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Marco Sulla

I think I've _almost_ found a simpler, general way:

import os

_lf = "\n"
_cr = "\r"

def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
n_chunk_size = n * chunk_size
pos = os.stat(filepath).st_size
chunk_line_pos = -1
lines_not_found = n

with open(filepath, newline=newline, encoding=encoding) as f:
text = ""

hard_mode = False

if newline == None:
newline = _lf
elif newline == "":
hard_mode = True

if hard_mode:
while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
text = f.read()
lf_after = False

for i, char in enumerate(reversed(text)):
if char == _lf:
lf_after == True
elif char == _cr:
lines_not_found -= 1

newline_size = 2 if lf_after else 1

lf_after = False
elif lf_after:
lines_not_found -= 1
newline_size = 1
lf_after = False


if lines_not_found == 0:
chunk_line_pos = len(text) - 1 - i + newline_size
break

if lines_not_found == 0:
break
else:
while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
text = f.read()

for i, char in enumerate(reversed(text)):
if char == newline:
lines_not_found -= 1

if lines_not_found == 0:
chunk_line_pos = len(text) - 1 - i +
len(newline)
break

if lines_not_found == 0:
break


if chunk_line_pos == -1:
chunk_line_pos = 0

return text[chunk_line_pos:]


Shortly, the file is always opened in text mode. File is read at the end in
bigger and bigger chunks, until the file is finished or all the lines are
found.

Why? Because in encodings that have more than 1 byte per character, reading
a chunk of n bytes, then reading the previous chunk, can eventually split
the character between the chunks in two distinct bytes.

I think one can read chunk by chunk and test the chunk junction problem. I
suppose the code will be faster this way. Anyway, it seems that this trick
is quite fast anyway and it's a lot simpler.

The final result is read from the chunk, and not from the file, so there's
no problems of misalignment of bytes and text. Furthermore, the builtin
encoding parameter is used, so this should work with all the encodings
(untested).

Furthermore, a newline parameter can be specified, as in open(). If it's
equal to the empty string, the things are a little more complicated, anyway
I suppose the code is clear. It's untested too. I only tested with an utf8
linux file.

Do you think there are chances to get this function as a method of the file
object in CPython? The method for a file object opened in bytes mode is
simpler, since there's no encoding and newline is only \n in that case.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Barry Scott




> On 7 May 2022, at 14:40, Stefan Ram  wrote:
> 
> Marco Sulla  writes:
>> So there's no way to reliably read lines in reverse in text mode using
>> seek and read, but the only option is readlines?
> 
>  I think, CPython is based on C. I don't know whether
>  Python's seek function directly calls C's fseek function,
>  but maybe the following parts of the C standard also are
>  relevant for Python?

There is the posix API that and the C FILE API.

I expect that the odities you about NUL chars is all about the FILE
API. As far as I know its the posix API that C Python uses and it
does not suffer from issues with binary files.

Barry

> 
> |Setting the file position indicator to end-of-file, as with
> |fseek(file, 0, SEEK_END), has undefined behavior for a binary
> |stream (because of possible trailing null characters) or for
> |any stream with state-dependent encoding that does not
> |assuredly end in the initial shift state.
> from a footnote in a draft of a C standard
> 
> |For a text stream, either offset shall be zero, or offset
> |shall be a value returned by an earlier successful call to
> |the ftell function on a stream associated with the same file
> |and whence shall be SEEK_SET.
> from a draft of a C standard
> 
> |A text stream is an ordered sequence of characters composed
> |into lines, each line consisting of zero or more characters
> |plus a terminating new-line character. Whether the last line
> |requires a terminating new-line character is implementation-defined.
> from a draft of a C standard
> 
>  This might mean that reading from a text stream that is not
>  ending in a new-line character might have undefined behavior
>  (depending on the C implementation). In practice, it might
>  mean that some things could go wrong near the end of such
>  a stream. 
> 
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Barry Scott




> On 7 May 2022, at 22:31, Chris Angelico  wrote:
> 
> On Sun, 8 May 2022 at 07:19, Stefan Ram  wrote:
>> 
>> MRAB  writes:
>>> On 2022-05-07 19:47, Stefan Ram wrote:
>> ...
 def encoding( name ):
   path = pathlib.Path( name )
   for encoding in( "utf_8", "latin_1", "cp1252" ):
   try:
   with path.open( encoding=encoding, errors="strict" )as file:
   text = file.read()
   return encoding
   except UnicodeDecodeError:
   pass
   return "ascii"
 Yes, it's potentially slow and might be wrong.
 The result "ascii" might mean it's a binary file.
>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>> by that encoding.
>> 
>>  Thank you! It's working for my specific application where
>>  I'm reading from a collection of text files that should be
>>  encoded in either utf_8, latin_1, or ascii.
>> 
> 
> In that case, I'd exclude ASCII from the check, and just check UTF-8,
> and if that fails, decode as Latin-1. Any ASCII files will decode
> correctly as UTF-8, and any file will decode as Latin-1.
> 
> I've used this exact fallback system when decoding raw data from
> Unicode-naive servers - they accept and share bytes, so it's entirely
> possible to have a mix of encodings in a single stream. As long as you
> can define the span of a single "unit" (say, a line, or a chunk in
> some form), you can read as bytes and do the exact same "decode as
> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> perfectly ideal, but it's about as good as you'll get with a lot of
> US-based servers. (Depending on context, you might use CP-1252 instead
> of Latin-1, but you might need errors="replace" there, since
> Windows-1252 has some undefined byte values.)

There is a very common error on Windows that files and especially web pages that
claim to be utf-8 are in fact CP-1252.

There is logic in the HTML standards to try utf-8 and if it fails fall back to 
CP-1252.

Its usually the left and "smart" quote chars that cause the issue as they code
as an invalid utf-8.

Barry
 

> 
> ChrisA
> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Chris Angelico

On Mon, 9 May 2022 at 04:15, Barry Scott  wrote:
>
>
>
> > On 7 May 2022, at 22:31, Chris Angelico  wrote:
> >
> > On Sun, 8 May 2022 at 07:19, Stefan Ram  wrote:
> >>
> >> MRAB  writes:
> >>> On 2022-05-07 19:47, Stefan Ram wrote:
> >> ...
>  def encoding( name ):
>    path = pathlib.Path( name )
>    for encoding in( "utf_8", "latin_1", "cp1252" ):
>    try:
>    with path.open( encoding=encoding, errors="strict" )as file:
>    text = file.read()
>    return encoding
>    except UnicodeDecodeError:
>    pass
>    return "ascii"
>  Yes, it's potentially slow and might be wrong.
>  The result "ascii" might mean it's a binary file.
> >>> "latin-1" will decode any sequence of bytes, so it'll never try
> >>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >>> anyway because the file could contain 0x80..0xFF, which aren't supported
> >>> by that encoding.
> >>
> >>  Thank you! It's working for my specific application where
> >>  I'm reading from a collection of text files that should be
> >>  encoded in either utf_8, latin_1, or ascii.
> >>
> >
> > In that case, I'd exclude ASCII from the check, and just check UTF-8,
> > and if that fails, decode as Latin-1. Any ASCII files will decode
> > correctly as UTF-8, and any file will decode as Latin-1.
> >
> > I've used this exact fallback system when decoding raw data from
> > Unicode-naive servers - they accept and share bytes, so it's entirely
> > possible to have a mix of encodings in a single stream. As long as you
> > can define the span of a single "unit" (say, a line, or a chunk in
> > some form), you can read as bytes and do the exact same "decode as
> > UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> > perfectly ideal, but it's about as good as you'll get with a lot of
> > US-based servers. (Depending on context, you might use CP-1252 instead
> > of Latin-1, but you might need errors="replace" there, since
> > Windows-1252 has some undefined byte values.)
>
> There is a very common error on Windows that files and especially web pages 
> that
> claim to be utf-8 are in fact CP-1252.
>
> There is logic in the HTML standards to try utf-8 and if it fails fall back 
> to CP-1252.
>
> Its usually the left and "smart" quote chars that cause the issue as they code
> as an invalid utf-8.
>

Yeah, or sometimes, there isn't *anything* in UTF-8, and it has some
sort of straight-up lie in the form of a meta tag. It's annoying. But
the same logic still applies: attempt one decode (UTF-8) and if it
fails, there's one fallback. Fairly simple.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Barry Scott




> On 8 May 2022, at 17:05, Marco Sulla  wrote:
> 
> I think I've _almost_ found a simpler, general way:
> 
> import os
> 
> _lf = "\n"
> _cr = "\r"
> 
> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
>n_chunk_size = n * chunk_size

Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically 
the smaller size the file system will allocate.
I tend to read on multiple of MiB as its near instant.

>pos = os.stat(filepath).st_size

You cannot mix POSIX API with text mode.
pos is in bytes from the start of the file.
Textmode will be in code points. bytes != code points.

>chunk_line_pos = -1
>lines_not_found = n
> 
>with open(filepath, newline=newline, encoding=encoding) as f:
>text = ""
> 
>hard_mode = False
> 
>if newline == None:
>newline = _lf
>elif newline == "":
>hard_mode = True
> 
>if hard_mode:
>while pos != 0:
>pos -= n_chunk_size
> 
>if pos < 0:
>pos = 0
> 
>f.seek(pos)

In text mode you can only seek to a value return from f.tell() otherwise the 
behaviour is undefined.

>text = f.read()

You have on limit on the amount of data read.

>lf_after = False
> 
>for i, char in enumerate(reversed(text)):

Simple use text.rindex('\n') or text.rfind('\n') for speed.

>if char == _lf:
>lf_after == True
>elif char == _cr:
>lines_not_found -= 1
> 
>newline_size = 2 if lf_after else 1
> 
>lf_after = False
>elif lf_after:
>lines_not_found -= 1
>newline_size = 1
>lf_after = False
> 
> 
>if lines_not_found == 0:
>chunk_line_pos = len(text) - 1 - i + newline_size
>break
> 
>if lines_not_found == 0:
>break
>else:
>while pos != 0:
>pos -= n_chunk_size
> 
>if pos < 0:
>pos = 0
> 
>f.seek(pos)
>text = f.read()
> 
>for i, char in enumerate(reversed(text)):
>if char == newline:
>lines_not_found -= 1
> 
>if lines_not_found == 0:
>chunk_line_pos = len(text) - 1 - i +
> len(newline)
>break
> 
>if lines_not_found == 0:
>break
> 
> 
>if chunk_line_pos == -1:
>chunk_line_pos = 0
> 
>return text[chunk_line_pos:]
> 
> 
> Shortly, the file is always opened in text mode. File is read at the end in
> bigger and bigger chunks, until the file is finished or all the lines are
> found.

It will fail if the contents is not ASCII.

> 
> Why? Because in encodings that have more than 1 byte per character, reading
> a chunk of n bytes, then reading the previous chunk, can eventually split
> the character between the chunks in two distinct bytes.

No it cannot. text mode only knows how to return code points. Now if you are in
binary it could be split, but you are not in binary mode so it cannot.

> I think one can read chunk by chunk and test the chunk junction problem. I
> suppose the code will be faster this way. Anyway, it seems that this trick
> is quite fast anyway and it's a lot simpler.

> The final result is read from the chunk, and not from the file, so there's
> no problems of misalignment of bytes and text. Furthermore, the builtin
> encoding parameter is used, so this should work with all the encodings
> (untested).
> 
> Furthermore, a newline parameter can be specified, as in open(). If it's
> equal to the empty string, the things are a little more complicated, anyway
> I suppose the code is clear. It's untested too. I only tested with an utf8
> linux file.
> 
> Do you think there are chances to get this function as a method of the file
> object in CPython? The method for a file object opened in bytes mode is
> simpler, since there's no encoding and newline is only \n in that case.

State your requirements. Then see if your implementation meets them.

Barry

> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread MRAB


On 2022-05-08 19:15, Barry Scott wrote:




On 7 May 2022, at 22:31, Chris Angelico  wrote:

On Sun, 8 May 2022 at 07:19, Stefan Ram  wrote:


MRAB  writes:

On 2022-05-07 19:47, Stefan Ram wrote:

...

def encoding( name ):
  path = pathlib.Path( name )
  for encoding in( "utf_8", "latin_1", "cp1252" ):
  try:
  with path.open( encoding=encoding, errors="strict" )as file:
  text = file.read()
  return encoding
  except UnicodeDecodeError:
  pass
  return "ascii"
Yes, it's potentially slow and might be wrong.
The result "ascii" might mean it's a binary file.

"latin-1" will decode any sequence of bytes, so it'll never try
"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
anyway because the file could contain 0x80..0xFF, which aren't supported
by that encoding.


 Thank you! It's working for my specific application where
 I'm reading from a collection of text files that should be
 encoded in either utf_8, latin_1, or ascii.



In that case, I'd exclude ASCII from the check, and just check UTF-8,
and if that fails, decode as Latin-1. Any ASCII files will decode
correctly as UTF-8, and any file will decode as Latin-1.

I've used this exact fallback system when decoding raw data from
Unicode-naive servers - they accept and share bytes, so it's entirely
possible to have a mix of encodings in a single stream. As long as you
can define the span of a single "unit" (say, a line, or a chunk in
some form), you can read as bytes and do the exact same "decode as
UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
perfectly ideal, but it's about as good as you'll get with a lot of
US-based servers. (Depending on context, you might use CP-1252 instead
of Latin-1, but you might need errors="replace" there, since
Windows-1252 has some undefined byte values.)


There is a very common error on Windows that files and especially web pages that
claim to be utf-8 are in fact CP-1252.

There is logic in the HTML standards to try utf-8 and if it fails fall back to 
CP-1252.

Its usually the left and "smart" quote chars that cause the issue as they code
as an invalid utf-8.


Is it CP-1252 or ISO-8859-1 (Latin-1)?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Marco Sulla

On Sun, 8 May 2022 at 20:31, Barry Scott  wrote:
>
> > On 8 May 2022, at 17:05, Marco Sulla  wrote:
> >
> > def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> >n_chunk_size = n * chunk_size
>
> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically 
> the smaller size the file system will allocate.
> I tend to read on multiple of MiB as its near instant.

Well, I tested on a little file, a list of my preferred pizzas, so

> >pos = os.stat(filepath).st_size
>
> You cannot mix POSIX API with text mode.
> pos is in bytes from the start of the file.
> Textmode will be in code points. bytes != code points.
>
> >chunk_line_pos = -1
> >lines_not_found = n
> >
> >with open(filepath, newline=newline, encoding=encoding) as f:
> >text = ""
> >
> >hard_mode = False
> >
> >if newline == None:
> >newline = _lf
> >elif newline == "":
> >hard_mode = True
> >
> >if hard_mode:
> >while pos != 0:
> >pos -= n_chunk_size
> >
> >if pos < 0:
> >pos = 0
> >
> >f.seek(pos)
>
> In text mode you can only seek to a value return from f.tell() otherwise the 
> behaviour is undefined.

Why? I don't see any recommendation about it in the docs:
https://docs.python.org/3/library/io.html#io.IOBase.seek

> >text = f.read()
>
> You have on limit on the amount of data read.

I explained that previously. Anyway, chunk_size is small, so it's not
a great problem.

> >lf_after = False
> >
> >for i, char in enumerate(reversed(text)):
>
> Simple use text.rindex('\n') or text.rfind('\n') for speed.

I can't use them when I have to find both \n or \r. So I preferred to
simplify the code and use the for cycle every time. Take into mind
anyway that this is a prototype for a Python C Api implementation
(builtin I hope, or a C extension if not)

> > Shortly, the file is always opened in text mode. File is read at the end in
> > bigger and bigger chunks, until the file is finished or all the lines are
> > found.
>
> It will fail if the contents is not ASCII.

Why?

> > Why? Because in encodings that have more than 1 byte per character, reading
> > a chunk of n bytes, then reading the previous chunk, can eventually split
> > the character between the chunks in two distinct bytes.
>
> No it cannot. text mode only knows how to return code points. Now if you are 
> in
> binary it could be split, but you are not in binary mode so it cannot.

>From the docs:

seek(offset, whence=SEEK_SET)
Change the stream position to the given byte offset.

> > Do you think there are chances to get this function as a method of the file
> > object in CPython? The method for a file object opened in bytes mode is
> > simpler, since there's no encoding and newline is only \n in that case.
>
> State your requirements. Then see if your implementation meets them.

The method should return the last n lines from a file object.
If the file object is in text mode, the newline parameter must be honored.
If the file object is in binary mode, a newline is always b"\n", to be
consistent with readline.

I suppose the current implementation of tail satisfies the
requirements for text mode. The previous one satisfied binary mode.

Anyway, apart from my implementation, I'm curious if you think a tail
method is worth it to be a method of the builtin file objects in
CPython.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Chris Angelico

On Mon, 9 May 2022 at 05:49, Marco Sulla  wrote:
> Anyway, apart from my implementation, I'm curious if you think a tail
> method is worth it to be a method of the builtin file objects in
> CPython.

Absolutely not. As has been stated multiple times in this thread, a
fully general approach is extremely complicated, horrifically
unreliable, and hopelessly inefficient. The ONLY way to make this sort
of thing any good whatsoever is to know your own use-case and code to
exactly that. Given the size of files you're working with, for
instance, a simple approach of just reading the whole file would make
far more sense than the complex seeking you're doing. For reading a
multi-gigabyte file, the choices will be different.

No, this does NOT belong in the core language.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Marco Sulla

On Sun, 8 May 2022 at 22:02, Chris Angelico  wrote:
>
> Absolutely not. As has been stated multiple times in this thread, a
> fully general approach is extremely complicated, horrifically
> unreliable, and hopelessly inefficient.

Well, my implementation is quite general now. It's not complicated and
inefficient. About reliability, I can't say anything without a test
case.

> The ONLY way to make this sort
> of thing any good whatsoever is to know your own use-case and code to
> exactly that. Given the size of files you're working with, for
> instance, a simple approach of just reading the whole file would make
> far more sense than the complex seeking you're doing. For reading a
> multi-gigabyte file, the choices will be different.

Apart from the fact that it's very, very simple to optimize for small
files: this is, IMHO, a premature optimization. The code is quite fast
even if the file is small. Can it be faster? Of course, but it depends
on the use case. Every optimization in CPython must pass the benchmark
suite test. If there's little or no gain, the optimization is usually
rejected.

> No, this does NOT belong in the core language.

I respect your opinion, but IMHO you think that the task is more
complicated than the reality. It seems to me that the method can be
quite simple and fast.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Barry



> On 8 May 2022, at 20:48, Marco Sulla  wrote:
> 
> On Sun, 8 May 2022 at 20:31, Barry Scott  wrote:
>> 
 On 8 May 2022, at 17:05, Marco Sulla  wrote:
>>> 
>>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
>>>   n_chunk_size = n * chunk_size
>> 
>> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically 
>> the smaller size the file system will allocate.
>> I tend to read on multiple of MiB as its near instant.
> 
> Well, I tested on a little file, a list of my preferred pizzas, so

Try it on a very big file.

> 
>>>   pos = os.stat(filepath).st_size
>> 
>> You cannot mix POSIX API with text mode.
>> pos is in bytes from the start of the file.
>> Textmode will be in code points. bytes != code points.
>> 
>>>   chunk_line_pos = -1
>>>   lines_not_found = n
>>> 
>>>   with open(filepath, newline=newline, encoding=encoding) as f:
>>>   text = ""
>>> 
>>>   hard_mode = False
>>> 
>>>   if newline == None:
>>>   newline = _lf
>>>   elif newline == "":
>>>   hard_mode = True
>>> 
>>>   if hard_mode:
>>>   while pos != 0:
>>>   pos -= n_chunk_size
>>> 
>>>   if pos < 0:
>>>   pos = 0
>>> 
>>>   f.seek(pos)
>> 
>> In text mode you can only seek to a value return from f.tell() otherwise the 
>> behaviour is undefined.
> 
> Why? I don't see any recommendation about it in the docs:
> https://docs.python.org/3/library/io.html#io.IOBase.seek

What does adding 1 to a pos mean?
If it’s binary it mean 1 byte further down the file but in text mode it may 
need to
move the point 1, 2 or 3 bytes down the file.

> 
>>>   text = f.read()
>> 
>> You have on limit on the amount of data read.
> 
> I explained that previously. Anyway, chunk_size is small, so it's not
> a great problem.

Typo I meant you have no limit.

You read all the data till the end of the file that might be mega bytes of data.
> 
>>>   lf_after = False
>>> 
>>>   for i, char in enumerate(reversed(text)):
>> 
>> Simple use text.rindex('\n') or text.rfind('\n') for speed.
> 
> I can't use them when I have to find both \n or \r. So I preferred to
> simplify the code and use the for cycle every time. Take into mind
> anyway that this is a prototype for a Python C Api implementation
> (builtin I hope, or a C extension if not)
> 
>>> Shortly, the file is always opened in text mode. File is read at the end in
>>> bigger and bigger chunks, until the file is finished or all the lines are
>>> found.
>> 
>> It will fail if the contents is not ASCII.
> 
> Why?
> 
>>> Why? Because in encodings that have more than 1 byte per character, reading
>>> a chunk of n bytes, then reading the previous chunk, can eventually split
>>> the character between the chunks in two distinct bytes.
>> 
>> No it cannot. text mode only knows how to return code points. Now if you are 
>> in
>> binary it could be split, but you are not in binary mode so it cannot.
> 
>> From the docs:
> 
> seek(offset, whence=SEEK_SET)
> Change the stream position to the given byte offset.
> 
>>> Do you think there are chances to get this function as a method of the file
>>> object in CPython? The method for a file object opened in bytes mode is
>>> simpler, since there's no encoding and newline is only \n in that case.
>> 
>> State your requirements. Then see if your implementation meets them.
> 
> The method should return the last n lines from a file object.
> If the file object is in text mode, the newline parameter must be honored.
> If the file object is in binary mode, a newline is always b"\n", to be
> consistent with readline.
> 
> I suppose the current implementation of tail satisfies the
> requirements for text mode. The previous one satisfied binary mode.
> 
> Anyway, apart from my implementation, I'm curious if you think a tail
> method is worth it to be a method of the builtin file objects in
> CPython.
> 

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Marco Sulla

On Sun, 8 May 2022 at 22:34, Barry  wrote:
>
> > On 8 May 2022, at 20:48, Marco Sulla  wrote:
> >
> > On Sun, 8 May 2022 at 20:31, Barry Scott  wrote:
> >>
>  On 8 May 2022, at 17:05, Marco Sulla  
>  wrote:
> >>>
> >>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> >>>   n_chunk_size = n * chunk_size
> >>
> >> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its 
> >> typically the smaller size the file system will allocate.
> >> I tend to read on multiple of MiB as its near instant.
> >
> > Well, I tested on a little file, a list of my preferred pizzas, so
>
> Try it on a very big file.

I'm not saying it's a good idea, it's only the value that I needed for my tests.
Anyway, it's not a problem with big files. The problem is with files
with long lines.

> >> In text mode you can only seek to a value return from f.tell() otherwise 
> >> the behaviour is undefined.
> >
> > Why? I don't see any recommendation about it in the docs:
> > https://docs.python.org/3/library/io.html#io.IOBase.seek
>
> What does adding 1 to a pos mean?
> If it’s binary it mean 1 byte further down the file but in text mode it may 
> need to
> move the point 1, 2 or 3 bytes down the file.

Emh. I re-quote

seek(offset, whence=SEEK_SET)
Change the stream position to the given byte offset.

And so on. No mention of differences between text and binary mode.

> >> You have on limit on the amount of data read.
> >
> > I explained that previously. Anyway, chunk_size is small, so it's not
> > a great problem.
>
> Typo I meant you have no limit.
>
> You read all the data till the end of the file that might be mega bytes of 
> data.

Yes, I already explained why and how it could be optimized. I quote myself:

Shortly, the file is always opened in text mode. File is read at the
end in bigger and bigger chunks, until the file is finished or all the
lines are found.

Why? Because in encodings that have more than 1 byte per character,
reading a chunk of n bytes, then reading the previous chunk, can
eventually split the character between the chunks in two distinct
bytes.

I think one can read chunk by chunk and test the chunk junction
problem. I suppose the code will be faster this way. Anyway, it seems
that this trick is quite fast anyway and it's a lot simpler.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

2022-05-08 Thread Cameron Simpson

On 08May2022 22:48, Marco Sulla  wrote:
>On Sun, 8 May 2022 at 22:34, Barry  wrote:
>> >> In text mode you can only seek to a value return from f.tell() 
>> >> otherwise the behaviour is undefined.
>> >
>> > Why? I don't see any recommendation about it in the docs:
>> > https://docs.python.org/3/library/io.html#io.IOBase.seek
>>
>> What does adding 1 to a pos mean?
>> If it’s binary it mean 1 byte further down the file but in text mode it may 
>> need to
>> move the point 1, 2 or 3 bytes down the file.
>
>Emh. I re-quote
>
>seek(offset, whence=SEEK_SET)
>Change the stream position to the given byte offset.
>
>And so on. No mention of differences between text and binary mode.

You're looking at IOBase, the _binary_ basis of low level common file 
I/O. Compare with: https://docs.python.org/3/library/io.html#io.TextIOBase.seek
The positions are "opaque numbers", which means you should not ascribe 
any deeper meaning to them except that they represent a point in the 
file. It clearly says "offset must either be a number returned by 
TextIOBase.tell(), or zero. Any other offset value produces undefined 
behaviour."

The point here is that text is a very different thing. Because you 
cannot seek to an absolute number of characters in an encoding with 
variable sized characters. _If_ you did a seek to an arbitrary number 
you can end up in the middle of some character. And there are encodings 
where you cannot inspect the data to find a character boundary in the 
byte stream.

Reading text files backwards is not a well defined thing without 
additional criteria:
- knowing the text file actually ended on a character boundary
- knowing how to find a character boundary

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: tail

Re: tail

Re: tail

Re: tail

Re: tail

Re: tail

Re: tail

Re: tail

Re: tail

Re: tail

Re: tail

Re: tail

Re: tail

13 matches

Site Navigation

Mail list logo

Footer information