Re: tail
> On 7 May 2022, at 17:29, Marco Sulla wrote:
>
> On Sat, 7 May 2022 at 16:08, Barry wrote:
>> You need to handle the file in bin mode and do the handling of line endings
>> and encodings yourself. It’s not that hard for the cases you wanted.
>
"\n".encode("utf-16")
> b'\xff\xfe\n\x00'
"".encode("utf-16")
> b'\xff\xfe'
"a\nb".encode("utf-16")
> b'\xff\xfea\x00\n\x00b\x00'
"\n".encode("utf-16").lstrip("".encode("utf-16"))
> b'\n\x00'
>
> Can I use the last trick to get the encoding of a LF or a CR in any encoding?
In a word no.
There are cases that you just have to know the encoding you are working with.
utf-16 because you have deal with the data in 2 byte units and know if
it is big endian or little endian.
There will be other encoding that will also be difficult.
But if you are working with encoding that are using ASCII as a base,
like unicode encoded as utf-8 or iso-8859 series then you can just look
for NL and CR using the ASCII values of the byte.
In short once you set your requirements then you can know what problems
you can avoid and which you must solve.
Is utf-16 important to you? If not no need to solve its issues.
Barry
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail
I think I've _almost_ found a simpler, general way: import os _lf = "\n" _cr = "\r" def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100): n_chunk_size = n * chunk_size pos = os.stat(filepath).st_size chunk_line_pos = -1 lines_not_found = n with open(filepath, newline=newline, encoding=encoding) as f: text = "" hard_mode = False if newline == None: newline = _lf elif newline == "": hard_mode = True if hard_mode: while pos != 0: pos -= n_chunk_size if pos < 0: pos = 0 f.seek(pos) text = f.read() lf_after = False for i, char in enumerate(reversed(text)): if char == _lf: lf_after == True elif char == _cr: lines_not_found -= 1 newline_size = 2 if lf_after else 1 lf_after = False elif lf_after: lines_not_found -= 1 newline_size = 1 lf_after = False if lines_not_found == 0: chunk_line_pos = len(text) - 1 - i + newline_size break if lines_not_found == 0: break else: while pos != 0: pos -= n_chunk_size if pos < 0: pos = 0 f.seek(pos) text = f.read() for i, char in enumerate(reversed(text)): if char == newline: lines_not_found -= 1 if lines_not_found == 0: chunk_line_pos = len(text) - 1 - i + len(newline) break if lines_not_found == 0: break if chunk_line_pos == -1: chunk_line_pos = 0 return text[chunk_line_pos:] Shortly, the file is always opened in text mode. File is read at the end in bigger and bigger chunks, until the file is finished or all the lines are found. Why? Because in encodings that have more than 1 byte per character, reading a chunk of n bytes, then reading the previous chunk, can eventually split the character between the chunks in two distinct bytes. I think one can read chunk by chunk and test the chunk junction problem. I suppose the code will be faster this way. Anyway, it seems that this trick is quite fast anyway and it's a lot simpler. The final result is read from the chunk, and not from the file, so there's no problems of misalignment of bytes and text. Furthermore, the builtin encoding parameter is used, so this should work with all the encodings (untested). Furthermore, a newline parameter can be specified, as in open(). If it's equal to the empty string, the things are a little more complicated, anyway I suppose the code is clear. It's untested too. I only tested with an utf8 linux file. Do you think there are chances to get this function as a method of the file object in CPython? The method for a file object opened in bytes mode is simpler, since there's no encoding and newline is only \n in that case. -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
> On 7 May 2022, at 14:40, Stefan Ram wrote: > > Marco Sulla writes: >> So there's no way to reliably read lines in reverse in text mode using >> seek and read, but the only option is readlines? > > I think, CPython is based on C. I don't know whether > Python's seek function directly calls C's fseek function, > but maybe the following parts of the C standard also are > relevant for Python? There is the posix API that and the C FILE API. I expect that the odities you about NUL chars is all about the FILE API. As far as I know its the posix API that C Python uses and it does not suffer from issues with binary files. Barry > > |Setting the file position indicator to end-of-file, as with > |fseek(file, 0, SEEK_END), has undefined behavior for a binary > |stream (because of possible trailing null characters) or for > |any stream with state-dependent encoding that does not > |assuredly end in the initial shift state. > from a footnote in a draft of a C standard > > |For a text stream, either offset shall be zero, or offset > |shall be a value returned by an earlier successful call to > |the ftell function on a stream associated with the same file > |and whence shall be SEEK_SET. > from a draft of a C standard > > |A text stream is an ordered sequence of characters composed > |into lines, each line consisting of zero or more characters > |plus a terminating new-line character. Whether the last line > |requires a terminating new-line character is implementation-defined. > from a draft of a C standard > > This might mean that reading from a text stream that is not > ending in a new-line character might have undefined behavior > (depending on the C implementation). In practice, it might > mean that some things could go wrong near the end of such > a stream. > > > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
> On 7 May 2022, at 22:31, Chris Angelico wrote: > > On Sun, 8 May 2022 at 07:19, Stefan Ram wrote: >> >> MRAB writes: >>> On 2022-05-07 19:47, Stefan Ram wrote: >> ... def encoding( name ): path = pathlib.Path( name ) for encoding in( "utf_8", "latin_1", "cp1252" ): try: with path.open( encoding=encoding, errors="strict" )as file: text = file.read() return encoding except UnicodeDecodeError: pass return "ascii" Yes, it's potentially slow and might be wrong. The result "ascii" might mean it's a binary file. >>> "latin-1" will decode any sequence of bytes, so it'll never try >>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong >>> anyway because the file could contain 0x80..0xFF, which aren't supported >>> by that encoding. >> >> Thank you! It's working for my specific application where >> I'm reading from a collection of text files that should be >> encoded in either utf_8, latin_1, or ascii. >> > > In that case, I'd exclude ASCII from the check, and just check UTF-8, > and if that fails, decode as Latin-1. Any ASCII files will decode > correctly as UTF-8, and any file will decode as Latin-1. > > I've used this exact fallback system when decoding raw data from > Unicode-naive servers - they accept and share bytes, so it's entirely > possible to have a mix of encodings in a single stream. As long as you > can define the span of a single "unit" (say, a line, or a chunk in > some form), you can read as bytes and do the exact same "decode as > UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not > perfectly ideal, but it's about as good as you'll get with a lot of > US-based servers. (Depending on context, you might use CP-1252 instead > of Latin-1, but you might need errors="replace" there, since > Windows-1252 has some undefined byte values.) There is a very common error on Windows that files and especially web pages that claim to be utf-8 are in fact CP-1252. There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252. Its usually the left and "smart" quote chars that cause the issue as they code as an invalid utf-8. Barry > > ChrisA > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
On Mon, 9 May 2022 at 04:15, Barry Scott wrote: > > > > > On 7 May 2022, at 22:31, Chris Angelico wrote: > > > > On Sun, 8 May 2022 at 07:19, Stefan Ram wrote: > >> > >> MRAB writes: > >>> On 2022-05-07 19:47, Stefan Ram wrote: > >> ... > def encoding( name ): > path = pathlib.Path( name ) > for encoding in( "utf_8", "latin_1", "cp1252" ): > try: > with path.open( encoding=encoding, errors="strict" )as file: > text = file.read() > return encoding > except UnicodeDecodeError: > pass > return "ascii" > Yes, it's potentially slow and might be wrong. > The result "ascii" might mean it's a binary file. > >>> "latin-1" will decode any sequence of bytes, so it'll never try > >>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong > >>> anyway because the file could contain 0x80..0xFF, which aren't supported > >>> by that encoding. > >> > >> Thank you! It's working for my specific application where > >> I'm reading from a collection of text files that should be > >> encoded in either utf_8, latin_1, or ascii. > >> > > > > In that case, I'd exclude ASCII from the check, and just check UTF-8, > > and if that fails, decode as Latin-1. Any ASCII files will decode > > correctly as UTF-8, and any file will decode as Latin-1. > > > > I've used this exact fallback system when decoding raw data from > > Unicode-naive servers - they accept and share bytes, so it's entirely > > possible to have a mix of encodings in a single stream. As long as you > > can define the span of a single "unit" (say, a line, or a chunk in > > some form), you can read as bytes and do the exact same "decode as > > UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not > > perfectly ideal, but it's about as good as you'll get with a lot of > > US-based servers. (Depending on context, you might use CP-1252 instead > > of Latin-1, but you might need errors="replace" there, since > > Windows-1252 has some undefined byte values.) > > There is a very common error on Windows that files and especially web pages > that > claim to be utf-8 are in fact CP-1252. > > There is logic in the HTML standards to try utf-8 and if it fails fall back > to CP-1252. > > Its usually the left and "smart" quote chars that cause the issue as they code > as an invalid utf-8. > Yeah, or sometimes, there isn't *anything* in UTF-8, and it has some sort of straight-up lie in the form of a meta tag. It's annoying. But the same logic still applies: attempt one decode (UTF-8) and if it fails, there's one fallback. Fairly simple. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
> On 8 May 2022, at 17:05, Marco Sulla wrote:
>
> I think I've _almost_ found a simpler, general way:
>
> import os
>
> _lf = "\n"
> _cr = "\r"
>
> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
>n_chunk_size = n * chunk_size
Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically
the smaller size the file system will allocate.
I tend to read on multiple of MiB as its near instant.
>pos = os.stat(filepath).st_size
You cannot mix POSIX API with text mode.
pos is in bytes from the start of the file.
Textmode will be in code points. bytes != code points.
>chunk_line_pos = -1
>lines_not_found = n
>
>with open(filepath, newline=newline, encoding=encoding) as f:
>text = ""
>
>hard_mode = False
>
>if newline == None:
>newline = _lf
>elif newline == "":
>hard_mode = True
>
>if hard_mode:
>while pos != 0:
>pos -= n_chunk_size
>
>if pos < 0:
>pos = 0
>
>f.seek(pos)
In text mode you can only seek to a value return from f.tell() otherwise the
behaviour is undefined.
>text = f.read()
You have on limit on the amount of data read.
>lf_after = False
>
>for i, char in enumerate(reversed(text)):
Simple use text.rindex('\n') or text.rfind('\n') for speed.
>if char == _lf:
>lf_after == True
>elif char == _cr:
>lines_not_found -= 1
>
>newline_size = 2 if lf_after else 1
>
>lf_after = False
>elif lf_after:
>lines_not_found -= 1
>newline_size = 1
>lf_after = False
>
>
>if lines_not_found == 0:
>chunk_line_pos = len(text) - 1 - i + newline_size
>break
>
>if lines_not_found == 0:
>break
>else:
>while pos != 0:
>pos -= n_chunk_size
>
>if pos < 0:
>pos = 0
>
>f.seek(pos)
>text = f.read()
>
>for i, char in enumerate(reversed(text)):
>if char == newline:
>lines_not_found -= 1
>
>if lines_not_found == 0:
>chunk_line_pos = len(text) - 1 - i +
> len(newline)
>break
>
>if lines_not_found == 0:
>break
>
>
>if chunk_line_pos == -1:
>chunk_line_pos = 0
>
>return text[chunk_line_pos:]
>
>
> Shortly, the file is always opened in text mode. File is read at the end in
> bigger and bigger chunks, until the file is finished or all the lines are
> found.
It will fail if the contents is not ASCII.
>
> Why? Because in encodings that have more than 1 byte per character, reading
> a chunk of n bytes, then reading the previous chunk, can eventually split
> the character between the chunks in two distinct bytes.
No it cannot. text mode only knows how to return code points. Now if you are in
binary it could be split, but you are not in binary mode so it cannot.
> I think one can read chunk by chunk and test the chunk junction problem. I
> suppose the code will be faster this way. Anyway, it seems that this trick
> is quite fast anyway and it's a lot simpler.
> The final result is read from the chunk, and not from the file, so there's
> no problems of misalignment of bytes and text. Furthermore, the builtin
> encoding parameter is used, so this should work with all the encodings
> (untested).
>
> Furthermore, a newline parameter can be specified, as in open(). If it's
> equal to the empty string, the things are a little more complicated, anyway
> I suppose the code is clear. It's untested too. I only tested with an utf8
> linux file.
>
> Do you think there are chances to get this function as a method of the file
> object in CPython? The method for a file object opened in bytes mode is
> simpler, since there's no encoding and newline is only \n in that case.
State your requirements. Then see if your implementation meets them.
Barry
> --
> https://mail.python.org/mailman/listinfo/python-list
>
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail
On 2022-05-08 19:15, Barry Scott wrote: On 7 May 2022, at 22:31, Chris Angelico wrote: On Sun, 8 May 2022 at 07:19, Stefan Ram wrote: MRAB writes: On 2022-05-07 19:47, Stefan Ram wrote: ... def encoding( name ): path = pathlib.Path( name ) for encoding in( "utf_8", "latin_1", "cp1252" ): try: with path.open( encoding=encoding, errors="strict" )as file: text = file.read() return encoding except UnicodeDecodeError: pass return "ascii" Yes, it's potentially slow and might be wrong. The result "ascii" might mean it's a binary file. "latin-1" will decode any sequence of bytes, so it'll never try "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong anyway because the file could contain 0x80..0xFF, which aren't supported by that encoding. Thank you! It's working for my specific application where I'm reading from a collection of text files that should be encoded in either utf_8, latin_1, or ascii. In that case, I'd exclude ASCII from the check, and just check UTF-8, and if that fails, decode as Latin-1. Any ASCII files will decode correctly as UTF-8, and any file will decode as Latin-1. I've used this exact fallback system when decoding raw data from Unicode-naive servers - they accept and share bytes, so it's entirely possible to have a mix of encodings in a single stream. As long as you can define the span of a single "unit" (say, a line, or a chunk in some form), you can read as bytes and do the exact same "decode as UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not perfectly ideal, but it's about as good as you'll get with a lot of US-based servers. (Depending on context, you might use CP-1252 instead of Latin-1, but you might need errors="replace" there, since Windows-1252 has some undefined byte values.) There is a very common error on Windows that files and especially web pages that claim to be utf-8 are in fact CP-1252. There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252. Its usually the left and "smart" quote chars that cause the issue as they code as an invalid utf-8. Is it CP-1252 or ISO-8859-1 (Latin-1)? -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
On Sun, 8 May 2022 at 20:31, Barry Scott wrote:
>
> > On 8 May 2022, at 17:05, Marco Sulla wrote:
> >
> > def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> >n_chunk_size = n * chunk_size
>
> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically
> the smaller size the file system will allocate.
> I tend to read on multiple of MiB as its near instant.
Well, I tested on a little file, a list of my preferred pizzas, so
> >pos = os.stat(filepath).st_size
>
> You cannot mix POSIX API with text mode.
> pos is in bytes from the start of the file.
> Textmode will be in code points. bytes != code points.
>
> >chunk_line_pos = -1
> >lines_not_found = n
> >
> >with open(filepath, newline=newline, encoding=encoding) as f:
> >text = ""
> >
> >hard_mode = False
> >
> >if newline == None:
> >newline = _lf
> >elif newline == "":
> >hard_mode = True
> >
> >if hard_mode:
> >while pos != 0:
> >pos -= n_chunk_size
> >
> >if pos < 0:
> >pos = 0
> >
> >f.seek(pos)
>
> In text mode you can only seek to a value return from f.tell() otherwise the
> behaviour is undefined.
Why? I don't see any recommendation about it in the docs:
https://docs.python.org/3/library/io.html#io.IOBase.seek
> >text = f.read()
>
> You have on limit on the amount of data read.
I explained that previously. Anyway, chunk_size is small, so it's not
a great problem.
> >lf_after = False
> >
> >for i, char in enumerate(reversed(text)):
>
> Simple use text.rindex('\n') or text.rfind('\n') for speed.
I can't use them when I have to find both \n or \r. So I preferred to
simplify the code and use the for cycle every time. Take into mind
anyway that this is a prototype for a Python C Api implementation
(builtin I hope, or a C extension if not)
> > Shortly, the file is always opened in text mode. File is read at the end in
> > bigger and bigger chunks, until the file is finished or all the lines are
> > found.
>
> It will fail if the contents is not ASCII.
Why?
> > Why? Because in encodings that have more than 1 byte per character, reading
> > a chunk of n bytes, then reading the previous chunk, can eventually split
> > the character between the chunks in two distinct bytes.
>
> No it cannot. text mode only knows how to return code points. Now if you are
> in
> binary it could be split, but you are not in binary mode so it cannot.
>From the docs:
seek(offset, whence=SEEK_SET)
Change the stream position to the given byte offset.
> > Do you think there are chances to get this function as a method of the file
> > object in CPython? The method for a file object opened in bytes mode is
> > simpler, since there's no encoding and newline is only \n in that case.
>
> State your requirements. Then see if your implementation meets them.
The method should return the last n lines from a file object.
If the file object is in text mode, the newline parameter must be honored.
If the file object is in binary mode, a newline is always b"\n", to be
consistent with readline.
I suppose the current implementation of tail satisfies the
requirements for text mode. The previous one satisfied binary mode.
Anyway, apart from my implementation, I'm curious if you think a tail
method is worth it to be a method of the builtin file objects in
CPython.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail
On Mon, 9 May 2022 at 05:49, Marco Sulla wrote: > Anyway, apart from my implementation, I'm curious if you think a tail > method is worth it to be a method of the builtin file objects in > CPython. Absolutely not. As has been stated multiple times in this thread, a fully general approach is extremely complicated, horrifically unreliable, and hopelessly inefficient. The ONLY way to make this sort of thing any good whatsoever is to know your own use-case and code to exactly that. Given the size of files you're working with, for instance, a simple approach of just reading the whole file would make far more sense than the complex seeking you're doing. For reading a multi-gigabyte file, the choices will be different. No, this does NOT belong in the core language. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
On Sun, 8 May 2022 at 22:02, Chris Angelico wrote: > > Absolutely not. As has been stated multiple times in this thread, a > fully general approach is extremely complicated, horrifically > unreliable, and hopelessly inefficient. Well, my implementation is quite general now. It's not complicated and inefficient. About reliability, I can't say anything without a test case. > The ONLY way to make this sort > of thing any good whatsoever is to know your own use-case and code to > exactly that. Given the size of files you're working with, for > instance, a simple approach of just reading the whole file would make > far more sense than the complex seeking you're doing. For reading a > multi-gigabyte file, the choices will be different. Apart from the fact that it's very, very simple to optimize for small files: this is, IMHO, a premature optimization. The code is quite fast even if the file is small. Can it be faster? Of course, but it depends on the use case. Every optimization in CPython must pass the benchmark suite test. If there's little or no gain, the optimization is usually rejected. > No, this does NOT belong in the core language. I respect your opinion, but IMHO you think that the task is more complicated than the reality. It seems to me that the method can be quite simple and fast. -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
> On 8 May 2022, at 20:48, Marco Sulla wrote:
>
> On Sun, 8 May 2022 at 20:31, Barry Scott wrote:
>>
On 8 May 2022, at 17:05, Marco Sulla wrote:
>>>
>>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
>>> n_chunk_size = n * chunk_size
>>
>> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically
>> the smaller size the file system will allocate.
>> I tend to read on multiple of MiB as its near instant.
>
> Well, I tested on a little file, a list of my preferred pizzas, so
Try it on a very big file.
>
>>> pos = os.stat(filepath).st_size
>>
>> You cannot mix POSIX API with text mode.
>> pos is in bytes from the start of the file.
>> Textmode will be in code points. bytes != code points.
>>
>>> chunk_line_pos = -1
>>> lines_not_found = n
>>>
>>> with open(filepath, newline=newline, encoding=encoding) as f:
>>> text = ""
>>>
>>> hard_mode = False
>>>
>>> if newline == None:
>>> newline = _lf
>>> elif newline == "":
>>> hard_mode = True
>>>
>>> if hard_mode:
>>> while pos != 0:
>>> pos -= n_chunk_size
>>>
>>> if pos < 0:
>>> pos = 0
>>>
>>> f.seek(pos)
>>
>> In text mode you can only seek to a value return from f.tell() otherwise the
>> behaviour is undefined.
>
> Why? I don't see any recommendation about it in the docs:
> https://docs.python.org/3/library/io.html#io.IOBase.seek
What does adding 1 to a pos mean?
If it’s binary it mean 1 byte further down the file but in text mode it may
need to
move the point 1, 2 or 3 bytes down the file.
>
>>> text = f.read()
>>
>> You have on limit on the amount of data read.
>
> I explained that previously. Anyway, chunk_size is small, so it's not
> a great problem.
Typo I meant you have no limit.
You read all the data till the end of the file that might be mega bytes of data.
>
>>> lf_after = False
>>>
>>> for i, char in enumerate(reversed(text)):
>>
>> Simple use text.rindex('\n') or text.rfind('\n') for speed.
>
> I can't use them when I have to find both \n or \r. So I preferred to
> simplify the code and use the for cycle every time. Take into mind
> anyway that this is a prototype for a Python C Api implementation
> (builtin I hope, or a C extension if not)
>
>>> Shortly, the file is always opened in text mode. File is read at the end in
>>> bigger and bigger chunks, until the file is finished or all the lines are
>>> found.
>>
>> It will fail if the contents is not ASCII.
>
> Why?
>
>>> Why? Because in encodings that have more than 1 byte per character, reading
>>> a chunk of n bytes, then reading the previous chunk, can eventually split
>>> the character between the chunks in two distinct bytes.
>>
>> No it cannot. text mode only knows how to return code points. Now if you are
>> in
>> binary it could be split, but you are not in binary mode so it cannot.
>
>> From the docs:
>
> seek(offset, whence=SEEK_SET)
> Change the stream position to the given byte offset.
>
>>> Do you think there are chances to get this function as a method of the file
>>> object in CPython? The method for a file object opened in bytes mode is
>>> simpler, since there's no encoding and newline is only \n in that case.
>>
>> State your requirements. Then see if your implementation meets them.
>
> The method should return the last n lines from a file object.
> If the file object is in text mode, the newline parameter must be honored.
> If the file object is in binary mode, a newline is always b"\n", to be
> consistent with readline.
>
> I suppose the current implementation of tail satisfies the
> requirements for text mode. The previous one satisfied binary mode.
>
> Anyway, apart from my implementation, I'm curious if you think a tail
> method is worth it to be a method of the builtin file objects in
> CPython.
>
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail
On Sun, 8 May 2022 at 22:34, Barry wrote: > > > On 8 May 2022, at 20:48, Marco Sulla wrote: > > > > On Sun, 8 May 2022 at 20:31, Barry Scott wrote: > >> > On 8 May 2022, at 17:05, Marco Sulla > wrote: > >>> > >>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100): > >>> n_chunk_size = n * chunk_size > >> > >> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its > >> typically the smaller size the file system will allocate. > >> I tend to read on multiple of MiB as its near instant. > > > > Well, I tested on a little file, a list of my preferred pizzas, so > > Try it on a very big file. I'm not saying it's a good idea, it's only the value that I needed for my tests. Anyway, it's not a problem with big files. The problem is with files with long lines. > >> In text mode you can only seek to a value return from f.tell() otherwise > >> the behaviour is undefined. > > > > Why? I don't see any recommendation about it in the docs: > > https://docs.python.org/3/library/io.html#io.IOBase.seek > > What does adding 1 to a pos mean? > If it’s binary it mean 1 byte further down the file but in text mode it may > need to > move the point 1, 2 or 3 bytes down the file. Emh. I re-quote seek(offset, whence=SEEK_SET) Change the stream position to the given byte offset. And so on. No mention of differences between text and binary mode. > >> You have on limit on the amount of data read. > > > > I explained that previously. Anyway, chunk_size is small, so it's not > > a great problem. > > Typo I meant you have no limit. > > You read all the data till the end of the file that might be mega bytes of > data. Yes, I already explained why and how it could be optimized. I quote myself: Shortly, the file is always opened in text mode. File is read at the end in bigger and bigger chunks, until the file is finished or all the lines are found. Why? Because in encodings that have more than 1 byte per character, reading a chunk of n bytes, then reading the previous chunk, can eventually split the character between the chunks in two distinct bytes. I think one can read chunk by chunk and test the chunk junction problem. I suppose the code will be faster this way. Anyway, it seems that this trick is quite fast anyway and it's a lot simpler. -- https://mail.python.org/mailman/listinfo/python-list
Re: tail
On 08May2022 22:48, Marco Sulla wrote: >On Sun, 8 May 2022 at 22:34, Barry wrote: >> >> In text mode you can only seek to a value return from f.tell() >> >> otherwise the behaviour is undefined. >> > >> > Why? I don't see any recommendation about it in the docs: >> > https://docs.python.org/3/library/io.html#io.IOBase.seek >> >> What does adding 1 to a pos mean? >> If it’s binary it mean 1 byte further down the file but in text mode it may >> need to >> move the point 1, 2 or 3 bytes down the file. > >Emh. I re-quote > >seek(offset, whence=SEEK_SET) >Change the stream position to the given byte offset. > >And so on. No mention of differences between text and binary mode. You're looking at IOBase, the _binary_ basis of low level common file I/O. Compare with: https://docs.python.org/3/library/io.html#io.TextIOBase.seek The positions are "opaque numbers", which means you should not ascribe any deeper meaning to them except that they represent a point in the file. It clearly says "offset must either be a number returned by TextIOBase.tell(), or zero. Any other offset value produces undefined behaviour." The point here is that text is a very different thing. Because you cannot seek to an absolute number of characters in an encoding with variable sized characters. _If_ you did a seek to an arbitrary number you can end up in the middle of some character. And there are encodings where you cannot inspect the data to find a character boundary in the byte stream. Reading text files backwards is not a well defined thing without additional criteria: - knowing the text file actually ended on a character boundary - knowing how to find a character boundary Cheers, Cameron Simpson -- https://mail.python.org/mailman/listinfo/python-list
