Re: windows utf8 & lxml

2016-12-27 Thread Steve D'Aprano
On Tue, 20 Dec 2016 10:53 pm, Sayth Renshaw wrote:

> content.read().encode('utf-8'), parser=utf8_parser)
> 
> However doing it in such a fashion returns this error:
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0:
> invalid start byte


That tells you that the XML file you have is not actually UTF-8.

You have a file that begins with a byte 0xFF. That is invalid UTF-8. No
valid UTF-8 string contains the byte 0xFF.

https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

So you need to consider:

- Are you sure that the input file is intended to be UTF-8? How was it
created? 

- Is the second byte 0xFE? If so, that suggests that you actually have
UTF-16 with a byte-order mark.





-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


ctypes, memory mapped files and context manager

2016-12-27 Thread Hans-Peter Jansen
Hi,

I'm using $subjects combination successfully in a project for 
creating/iterating over huge binary files (> 5GB) with impressive performance, 
while resource usage keeps pretty low, all with plain Python3 code. Nice!

Environment: (Python 3.4.5, Linux 4.8.14, openSUSE/x86_64, NFS4 and XFS 
filesystems)

The idea is: map a ctypes structure onto the file at a certain offset, act on 
the structure, and release the mapping. The latter is necessary for keeping 
the mmap file properly resizable and closable (due to the nature of mmaps and 
Python's posix implementation thereof). Hence, a context manager serves us 
well (in theory). 

Here's some code excerpt:

class cstructmap:
def __init__(self, cstruct, mm, offset = 0):
self._cstruct = cstruct
self._mm = mm
self._offset = offset
self._csinst = None

def __enter__(self):
# resize the mmap (and backing file), if structure exceeds mmap size
# mmap size must be aligned to mmap.PAGESIZE
cssize = ctypes.sizeof(self._cstruct)
if self._offset + cssize > self._mm.size():
newsize = align(self._offset + cssize, mmap.PAGESIZE)
self._mm.resize(newsize)
self._csinst = self._cstruct.from_buffer(self._mm, self._offset)
return self._csinst

def __exit__(self, exc_type, exc_value, exc_traceback):
# free all references into mmap
del self._csinst
self._csinst = None


def work():
with cstructmap(ItemHeader, self._mm, self._offset) as ih:
ih.identifier = ItemHeader.Identifier
ih.length = ItemHeaderSize + datasize

blktype = ctypes.c_char * datasize
with cstructmap(blktype, self._mm, self._offset) as blk:
blk.raw = data


In practice, this results in:

Traceback (most recent call last):
  File "ctypes_mmap_ctx.py", line 146, in 
mf.add_data(data)
  File "ctypes_mmap_ctx.py", line 113, in add_data
with cstructmap(blktype, self._mm, self._offset) as blk:
  File "ctypes_mmap_ctx.py", line 42, in __enter__
self._mm.resize(newsize)
BufferError: mmap can't resize with extant buffers exported.

The issue: when creating a mapping via context manager, we assign a local 
variable (with ..), that keep existing in the local context, even when the 
manager context was left. This keeps a reference on the ctypes mapped area 
alive, even if we try everything to destroy it in __exit__. We have to del the 
with var manually.

Now, I want to get rid of the ugly any error prone del statements.

What is needed, is a ctypes operation, that removes the mapping actively, and 
that could be added to the __exit__ part of the context manager.

Full working code example: 
https://gist.github.com/frispete/97c27e24a0aae1bcaf1375e2e463d239

The script creates a memory mapped file in the current directory named 
"mapfile". When started without arguments, it copies itself into this file, 
until 10 * mmap.PAGESIZE growth is reached (or it errored out before..).

IF you change NOPROB to True, it will actively destruct the context manager 
vars, and should work as advertized.

Any ideas are much appreciated.

Thanks in advance,
Pete

-- 
https://mail.python.org/mailman/listinfo/python-list


encoding="utf8" ignored when parsing XML

2016-12-27 Thread Skip Montanaro
I am trying to parse some XML which doesn't specify an encoding (Python 2.7.12 
via Anaconda on RH Linux), so it barfs when it encounters non-ASCII data. No 
great surprise there, but I'm having trouble getting it to use another 
encoding. First, I tried specifying the encoding when opening the file:

f = io.open(fname, encoding="utf8")
root = xml.etree.ElementTree.parse(f).getroot()

but that had no effect. Then, when calling xml.etree.ElementTree.parse I 
included an XMLParser object:

parser = xml.etree.ElementTree.XMLParser(encoding="utf8")
root = xml.etree.ElementTree.parse(f, parser=parser).getroot()

Same-o, same-o:

unicode error 'ascii' codec can't encode characters in position 1113-1116: 
ordinal not in range(128)

So, why does it continue to insist on using an ASCII codec? My locale's 
preferred encoding is:

>>> locale.getpreferredencoding()
'ANSI_X3.4-1968'

which I presume is the official way to spell "ascii".

The chardetect command (part of the chardet package) tells me it looks like 
utf8 with high confidence:

% chardetect < ~/tmp/trash
: utf-8 with confidence 0.99

I took a look at the code, and tracked the encoding I specified all the way 
down to the creation of the expat parser. What am I missing?

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: encoding="utf8" ignored when parsing XML

2016-12-27 Thread Peter Otten
Skip Montanaro wrote:

> I am trying to parse some XML which doesn't specify an encoding (Python
> 2.7.12 via Anaconda on RH Linux), so it barfs when it encounters non-ASCII
> data. No great surprise there, but I'm having trouble getting it to use
> another encoding. First, I tried specifying the encoding when opening the
> file:
> 
> f = io.open(fname, encoding="utf8")
> root = xml.etree.ElementTree.parse(f).getroot()
> 
> but that had no effect. 

Isn't UTF-8 the default?

Try opening the file in binary mode then:

with io.open(fname, "rb") as f:
root = xml.tree.ElementTree.parse(f).getroot()


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: encoding="utf8" ignored when parsing XML

2016-12-27 Thread Skip Montanaro
Peter> Isn't UTF-8 the default?

Apparently not. I believe in my reading it said that it used whatever
locale.getpreferredencoding() returned. That's problematic when you
live in a country that thinks ASCII is everything. Personally, I think
UTF-8 should be the default, but that train's long left the station,
at least for Python 2.x.

> Try opening the file in binary mode then:
>
> with io.open(fname, "rb") as f:
> root = xml.tree.ElementTree.parse(f).getroot()

Thanks, that worked. Would appreciate an explanation of why binary
mode was necessary. It would seem that since the file contents are
text, just in a non-ASCII encoding, that specifying the encoding when
opening the file should do the trick.

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: encoding="utf8" ignored when parsing XML

2016-12-27 Thread Peter Otten
Skip Montanaro wrote:

> Peter> Isn't UTF-8 the default?
> 
> Apparently not. 

Sorry, I meant the default for XML.

> I believe in my reading it said that it used whatever
> locale.getpreferredencoding() returned. That's problematic when you
> live in a country that thinks ASCII is everything. Personally, I think
> UTF-8 should be the default, but that train's long left the station,
> at least for Python 2.x.
> 
>> Try opening the file in binary mode then:
>>
>> with io.open(fname, "rb") as f:
>> root = xml.tree.ElementTree.parse(f).getroot()
> 
> Thanks, that worked. Would appreciate an explanation of why binary
> mode was necessary. It would seem that since the file contents are
> text, just in a non-ASCII encoding, that specifying the encoding when
> opening the file should do the trick.
> 
> Skip

My tentative explanation would be: If you open the file as text it will be 
successfully decoded, i. e.

io.open(fname, encoding="UTF-8").read()

works, but to go back to the bytes that the XML parser needs the "preferred 
encoding", in your case ASCII, will be used. 

Since there are non-ascii characters you get a UnicodeEncodeError.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: encoding="utf8" ignored when parsing XML

2016-12-27 Thread Peter Otten
Peter Otten wrote:

> works, but to go back to the bytes that the XML parser needs the
> "preferred encoding", in your case ASCII, will be used.

Correction: it's probably sys.getdefaultencoding() rather than 
locale.getdefaultencoding(). So all systems with a sane configuration will 
behave the same way as yours.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Parse a Wireshark pcap file

2016-12-27 Thread 1991manish . kumar

I have a pcap file, I want to parse that file & fetch some information like 
Timestamp, Packet Size, Source/Dest IP Address, Source/Dest Port, Source/ Dest 
MAC address.

I am trying this in Django.

other that Source/ Dest Port details, I am able to fetch everything.
Please tell me how I can get port details from pcap file.

This is my python code: 
https://github.com/manishkk/pcap-parser/blob/master/webapp/views.py

Thanking you in advance

On Wednesday, January 23, 2013 at 2:32:00 AM UTC+1, KDawg44 wrote:
> Is there a way to parse out a wireshark pcap file and extract key value pairs 
> from the data?  I am illustrated a sniff of some traffic and why it needs 
> utilize HTTPS instead of HTTP but I was hoping to run the pcap through a 
> python script and just output some interesting key value pairs 
> 
> 
> 
> Thanks for your help.
> 
> 
> Kevin

-- 
https://mail.python.org/mailman/listinfo/python-list


Mock object bug with assert_not_called

2016-12-27 Thread Diego Vela
Dear all,

>From reading the documentation it seemed like this is the place to post a
bug.  If not please let me know where is the proper place to do so.

Bug:
For mock objects created using the @patch decorator the following methods
are inconsistent.
assert_not_called,
assert_called_with

Steps
Create a mock object using the @patch decorator for a function.
call a function that that calls the mocked function.
check assert_not_called and assert_called with.

Thank you for your time.

-- 
Diego Vela
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ctypes, memory mapped files and context manager

2016-12-27 Thread Peter Otten
Hans-Peter Jansen wrote:

> Hi,
> 
> I'm using $subjects combination successfully in a project for
> creating/iterating over huge binary files (> 5GB) with impressive
> performance, while resource usage keeps pretty low, all with plain Python3
> code. Nice!
> 
> Environment: (Python 3.4.5, Linux 4.8.14, openSUSE/x86_64, NFS4 and XFS
> filesystems)
> 
> The idea is: map a ctypes structure onto the file at a certain offset, act
> on the structure, and release the mapping. The latter is necessary for
> keeping the mmap file properly resizable and closable (due to the nature
> of mmaps and Python's posix implementation thereof). Hence, a context
> manager serves us well (in theory).
> 
> Here's some code excerpt:
> 
> class cstructmap:
> def __init__(self, cstruct, mm, offset = 0):
> self._cstruct = cstruct
> self._mm = mm
> self._offset = offset
> self._csinst = None
> 
> def __enter__(self):
> # resize the mmap (and backing file), if structure exceeds mmap
> # size mmap size must be aligned to mmap.PAGESIZE
> cssize = ctypes.sizeof(self._cstruct)
> if self._offset + cssize > self._mm.size():
> newsize = align(self._offset + cssize, mmap.PAGESIZE)
> self._mm.resize(newsize)
> self._csinst = self._cstruct.from_buffer(self._mm, self._offset)
> return self._csinst

Here you give away a reference to the ctypes.BigEndianStructure. That means 
you no longer control the lifetime of self._csinst which in turn holds a 
reference to the underlying mmap or whatever it's called. 

There might be a way to release the mmap reference while the wrapper 
structure is still alive, but the cleaner way is probably to not give it 
away in the first place, and create a proxy instead with

  return weakref.proxy(self._csinst)


> 
> def __exit__(self, exc_type, exc_value, exc_traceback):
> # free all references into mmap
> del self._csinst

The line above is redundant. It removes the attribute from the instance 
__dict__ and implicitly decreases its refcount. It does not actually 
physically delete the referenced object. If you remove the del statement the 
line below will still decrease the refcount. 

Make sure you understand this to avoid littering your code with cargo cult 
del-s ;)

> self._csinst = None
> 
> 
> def work():
> with cstructmap(ItemHeader, self._mm, self._offset) as ih:
> ih.identifier = ItemHeader.Identifier
> ih.length = ItemHeaderSize + datasize
> 
> blktype = ctypes.c_char * datasize
> with cstructmap(blktype, self._mm, self._offset) as blk:
> blk.raw = data
> 
> 
> In practice, this results in:
> 
> Traceback (most recent call last):
>   File "ctypes_mmap_ctx.py", line 146, in 
> mf.add_data(data)
>   File "ctypes_mmap_ctx.py", line 113, in add_data
> with cstructmap(blktype, self._mm, self._offset) as blk:
>   File "ctypes_mmap_ctx.py", line 42, in __enter__
> self._mm.resize(newsize)
> BufferError: mmap can't resize with extant buffers exported.
> 
> The issue: when creating a mapping via context manager, we assign a local
> variable (with ..), that keep existing in the local context, even when the
> manager context was left. This keeps a reference on the ctypes mapped area
> alive, even if we try everything to destroy it in __exit__. We have to del
> the with var manually.
> 
> Now, I want to get rid of the ugly any error prone del statements.
> 
> What is needed, is a ctypes operation, that removes the mapping actively,
> and that could be added to the __exit__ part of the context manager.
> 
> Full working code example:
> https://gist.github.com/frispete/97c27e24a0aae1bcaf1375e2e463d239
> 
> The script creates a memory mapped file in the current directory named
> "mapfile". When started without arguments, it copies itself into this
> file, until 10 * mmap.PAGESIZE growth is reached (or it errored out
> before..).
> 
> IF you change NOPROB to True, it will actively destruct the context
> manager vars, and should work as advertized.
> 
> Any ideas are much appreciated.

You might put some more effort into composing example scripts. Something 
like the script below would have saved me some time...

import ctypes
import mmap

from contextlib import contextmanager

class T(ctypes.Structure):
_fields = [("foo", ctypes.c_uint32)]


@contextmanager
def map_struct(m, n):
m.resize(n * mmap.PAGESIZE)
yield T.from_buffer(m)

SIZE = mmap.PAGESIZE * 2
f = open("tmp.dat", "w+b")
f.write(b"\0" * SIZE)
f.seek(0)
m = mmap.mmap(f.fileno(), mmap.PAGESIZE)

with map_struct(m, 1) as a:
a.foo = 1
with map_struct(m, 2) as b:
b.foo = 2


> 
> Thanks in advance,
> Pete


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: encoding="utf8" ignored when parsing XML

2016-12-27 Thread Steve D'Aprano
On Wed, 28 Dec 2016 02:05 am, Skip Montanaro wrote:

> I am trying to parse some XML which doesn't specify an encoding (Python
> 2.7.12 via Anaconda on RH Linux), so it barfs when it encounters non-ASCII
> data. No great surprise there, but I'm having trouble getting it to use
> another encoding. First, I tried specifying the encoding when opening the
> file:
> 
> f = io.open(fname, encoding="utf8")
> root = xml.etree.ElementTree.parse(f).getroot()

The documentation for ET.parse is pretty sparse

https://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.parse


but we can infer that it should take bytes as argument, not Unicode, since
it does its own charset processing. (The optional parser argument takes an
encoding argument which defaults to UTF-8.)

So that means using the built-in open(), or io.open() in binary mode.

You open the file and read in bytes from disk, *decoding* those bytes into a
UTF-8 Unicode string. Then the ET parser tries to decode the Unicode string
into Unicode, which it does by first *encoding* it back to bytes using the
default encoding (namely ASCII), and that's where it blows up.

This particular error is a Python2-ism, since Python2 tries hard to let you
mix byte strings and unicode strings together, hence it will try implicitly
encoding/decoding strings to try to get them to fit together. Python3 does
not do this.

You can easily simulate this error at the REPL:



py> u"µ".decode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb5' in position
0: ordinal not in range(128)


The give-away is that you're intending to do a *decode* operation but get an
*encode* error. That tells you that Python2 is trying to be helpful :-)

(Remember: Unicode strings encode to bytes, and bytes decode back to
strings.)


You're trying to read bytes from a file on disk and get Unicode strings out:

bytes in file --> XML parser --> Unicode

so that counts as a decoding operation. But you're getting an encoding
error -- that's the smoking gun that suggests a dubious Unicode->bytes
step, using the default encoding (ASCII):

bytes in file --> io.open().read() --> Unicode --> XML Parser --> decode to
bytes using ASCII --> encode back to Unicode using UTF-8

And that suggests that the fix is to open the file without any charset
processing, i.e. use the builtin open() instead of io.open().

bytes in file --> builtin open().read() --> bytes --> XML Parser --> Unicode


I think you can even skip the 'rb' mode part: the real problem is that you
must not feed a Unicode string to the XML parser.



> but that had no effect. Then, when calling xml.etree.ElementTree.parse I
> included an XMLParser object:
> 
> parser = xml.etree.ElementTree.XMLParser(encoding="utf8")
> root = xml.etree.ElementTree.parse(f, parser=parser).getroot()

That's the default, so there's no functional change here. Hence, the same
error.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mock object bug with assert_not_called

2016-12-27 Thread Steve D'Aprano
On Wed, 28 Dec 2016 07:15 am, Diego Vela wrote:

> Dear all,
> 
>>From reading the documentation it seemed like this is the place to post a
> bug.  If not please let me know where is the proper place to do so.

Once you are sure that it truly is a bug, then the proper place is the bug
tracker:

http://bugs.python.org/

This is a good place to ask if you are unsure whether it is or isn't a bug.


> Bug:
> For mock objects created using the @patch decorator the following methods
> are inconsistent.
> assert_not_called,
> assert_called_with
> 
> Steps
> Create a mock object using the @patch decorator for a function.
> call a function that that calls the mocked function.
> check assert_not_called and assert_called with.

Thank you, but this bug report can be improved.

(1) What results do you expect? What results do you get?

(2) What version of Python are you using?

(3) Code is better than English. Can you supply a *minimal* (as small and
simple as possible) example that shows this bug?




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list