Re: multiple JSON documents in one file, change proposal

2018-12-01 Thread Marko Rauhamaa
Paul Rubin :

> Marko Rauhamaa  writes:
>> Having rejected different options (> https://en.wikipedia.org/wiki/JSON_streaming>), I settled with
>> terminating each JSON value with an ASCII NUL character, which is
>> illegal in JSON proper.
>
> Thanks, that Wikipedia article is helpful.  I'd prefer to not use stuff
> like NUL or RS because I like keeping the file human readable.  I might
> use netstring format (http://cr.yp.to/proto/netstrings.txt) but I'm even
> more convinced now that adding a streaming feature to the existing json
> module is the right way to do it.

We all have our preferences.

In my case, I need an explicit terminator marker to know when a JSON
value is complete. For example, if I should read from a socket:

   123

I can't yet parse it because there might be another digit coming. On the
other hand, the peer might not see any reason to send any further bytes
because "123" is all they wanted to send at the moment.

As for NUL, a control character that is illegal in all JSON contexts is
practical so the JSON chunks don't need to be escaped. An ASCII-esque
solution would be to pick ETX (= end of text). Unfortunately, a human
operator typing ETX (= ctrl-C) to terminate a JSON value will cause a
KeyboardInterrupt in many modern command-line interfaces.

It happens NUL (= ctrl-SPC = ctrl-@) is pretty easy to generate and
manipulate in editors and the command line.

The need for the format to be "typable" (and editable) is essential for
ad-hoc manual testing of components. That precludes all framing formats
that would necessitate a length prefix. HTTP would be horrible to have
to type even without the content-length problem, but BEEP (RFC 3080)
would suffer from the content-length (and CRLF!) issue as well.

Finally, couldn't any whitespace character work as a terminator? Yes, it
could, but it would force you to use a special JSON parser that is
prepared to handle the self-delineation. A NUL gives you many more
degrees of freedom in choosing your JSON tools.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: multiple JSON documents in one file, change proposal

2018-12-01 Thread Chris Angelico
On Sat, Dec 1, 2018 at 9:16 PM Marko Rauhamaa  wrote:
>
> Paul Rubin :
>
> > Marko Rauhamaa  writes:
> >> Having rejected different options ( >> https://en.wikipedia.org/wiki/JSON_streaming>), I settled with
> >> terminating each JSON value with an ASCII NUL character, which is
> >> illegal in JSON proper.
> >
> > Thanks, that Wikipedia article is helpful.  I'd prefer to not use stuff
> > like NUL or RS because I like keeping the file human readable.  I might
> > use netstring format (http://cr.yp.to/proto/netstrings.txt) but I'm even
> > more convinced now that adding a streaming feature to the existing json
> > module is the right way to do it.
>
> We all have our preferences.
>
> In my case, I need an explicit terminator marker to know when a JSON
> value is complete. For example, if I should read from a socket:
>
>123
>
> I can't yet parse it because there might be another digit coming. On the
> other hand, the peer might not see any reason to send any further bytes
> because "123" is all they wanted to send at the moment.

This is actually the only special case. Every other JSON value has a
clear end. So the only thing you need to say is that, if the sender
wishes to transmit a bare number, it must append a space. Seriously,
how often do you ACTUALLY send a bare number? I've sometimes sent a
string on its own, but even that is incredibly rare. Having to send a
simple space after a bare number is unlikely to cause much trouble.

> As for NUL, a control character that is illegal in all JSON contexts is
> practical so the JSON chunks don't need to be escaped. An ASCII-esque
> solution would be to pick ETX (= end of text). Unfortunately, a human
> operator typing ETX (= ctrl-C) to terminate a JSON value will cause a
> KeyboardInterrupt in many modern command-line interfaces.
>
> It happens NUL (= ctrl-SPC = ctrl-@) is pretty easy to generate and
> manipulate in editors and the command line.

I have no idea which editors YOU use, but if you poll across platforms
and systems, I'm pretty sure you'll find that not everyone can type
it. Furthermore, many tools use the presence of an 0x00 byte as
evidence that a file is binary, not text. (For instance, git does
this.) That might be a good choice for your personal use-case, but not
the general case, whereas the much simpler options listed on the
Wikipedia page are far more general, and actually wouldn't be THAT
hard for you to use.

> The need for the format to be "typable" (and editable) is essential for
> ad-hoc manual testing of components. That precludes all framing formats
> that would necessitate a length prefix. HTTP would be horrible to have
> to type even without the content-length problem, but BEEP (RFC 3080)
> would suffer from the content-length (and CRLF!) issue as well.

I dunno, I type HTTP manually often enough that it can't be all *that* horrible.

> Finally, couldn't any whitespace character work as a terminator? Yes, it
> could, but it would force you to use a special JSON parser that is
> prepared to handle the self-delineation. A NUL gives you many more
> degrees of freedom in choosing your JSON tools.

Either non-delimited or newline-delimited JSON is supported in a lot
of tools. I'm quite at a loss here as to how an unprintable character
gives you more freedom.

I get it: you have a bizarre set of tools and the normal solutions
don't work for you. But you can't complain about the tools not
supporting your use-cases. Just code up your own styles of doing
things that are unique to you.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: multiple JSON documents in one file, change proposal

2018-12-01 Thread Marko Rauhamaa
Chris Angelico :
> On Sat, Dec 1, 2018 at 9:16 PM Marko Rauhamaa  wrote:
>> The need for the format to be "typable" (and editable) is essential
>> for ad-hoc manual testing of components. That precludes all framing
>> formats that would necessitate a length prefix. HTTP would be
>> horrible to have to type even without the content-length problem, but
>> BEEP (RFC 3080) would suffer from the content-length (and CRLF!)
>> issue as well.
>
> I dunno, I type HTTP manually often enough that it can't be all *that*
> horrible.

Say I want to send this piece of JSON:

   {
   "msgtype": "echo-req",
   "opid": 3487547843
   }

and the framing format is HTTP. I will need to type something like this:

   POST / HTTP/1.1^M
   Host: localhost^M
   Content-type: application/json^M
   Content-length: 54^M
   ^M
   {
   "msgtype": "echo-req",
   "opid": 3487547843
   }

That's almost impossible to type without a syntax error.

>> Finally, couldn't any whitespace character work as a terminator? Yes,
>> it could, but it would force you to use a special JSON parser that is
>> prepared to handle the self-delineation. A NUL gives you many more
>> degrees of freedom in choosing your JSON tools.
>
> Either non-delimited or newline-delimited JSON is supported in a lot
> of tools. I'm quite at a loss here as to how an unprintable character
> gives you more freedom.

As stated by Paul in another context, newline-delimited is a no-go
because it forces you to restrict JSON to a subset that doesn't contain
newlines (see the JSON example above).

Of course, you could say that the terminating newline is only
interpreted as a terminator after a complete JSON value, but that's not
the format "supported in a lot of tools".

If you use any legal JSON character as a terminator, you have to make it
contextual or add an escaping syntax.

> I get it: you have a bizarre set of tools and the normal solutions
> don't work for you. But you can't complain about the tools not
> supporting your use-cases. Just code up your own styles of doing
> things that are unique to you.

There are numerous tools that parse complete JSON documents fine.
Framing JSON values with NUL-termination is trivial to add in any
programming environment. For example:

   def json_docs(path):
   with open(path) as f:
   for doc in f.read().split("\0")[:-1].
   yield json.loads(doc)


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: multiple JSON documents in one file, change proposal

2018-12-01 Thread Chris Angelico
On Sat, Dec 1, 2018 at 10:16 PM Marko Rauhamaa  wrote:
>
> Chris Angelico :
> > On Sat, Dec 1, 2018 at 9:16 PM Marko Rauhamaa  wrote:
> >> The need for the format to be "typable" (and editable) is essential
> >> for ad-hoc manual testing of components. That precludes all framing
> >> formats that would necessitate a length prefix. HTTP would be
> >> horrible to have to type even without the content-length problem, but
> >> BEEP (RFC 3080) would suffer from the content-length (and CRLF!)
> >> issue as well.
> >
> > I dunno, I type HTTP manually often enough that it can't be all *that*
> > horrible.
>
> Say I want to send this piece of JSON:
>
>{
>"msgtype": "echo-req",
>"opid": 3487547843
>}
>
> and the framing format is HTTP. I will need to type something like this:
>
>POST / HTTP/1.1^M
>Host: localhost^M
>Content-type: application/json^M
>Content-length: 54^M
>^M
>{
>"msgtype": "echo-req",
>"opid": 3487547843
>}
>
> That's almost impossible to type without a syntax error.

1) Set your Enter key to send CR-LF, at least for this operation.
That's half your problem solved.
2) Send the request like this:

POST / HTTP/1.0
Content-type: application/json

{"msgtype": "echo-req", "opid": 3487547843}

Then shut down your end of the connection, probably with Ctrl-D. I'm
fairly sure I can type that without bugs, and any compliant HTTP
server should be fine with it.

> >> Finally, couldn't any whitespace character work as a terminator? Yes,
> >> it could, but it would force you to use a special JSON parser that is
> >> prepared to handle the self-delineation. A NUL gives you many more
> >> degrees of freedom in choosing your JSON tools.
> >
> > Either non-delimited or newline-delimited JSON is supported in a lot
> > of tools. I'm quite at a loss here as to how an unprintable character
> > gives you more freedom.
>
> As stated by Paul in another context, newline-delimited is a no-go
> because it forces you to restrict JSON to a subset that doesn't contain
> newlines (see the JSON example above).
>
> Of course, you could say that the terminating newline is only
> interpreted as a terminator after a complete JSON value, but that's not
> the format "supported in a lot of tools".

The subset in question is simply "JSON without any newlines between
tokens", which has the exact meaning as it would have *with* those
newlines. So what you lose is the human-readability of being able to
break an object over multiple lines. Is that a problem? Use
non-delimited instead.

> If you use any legal JSON character as a terminator, you have to make it
> contextual or add an escaping syntax.

Or just use non-delimited, strip all whitespace between objects, and
then special-case the one otherwise-ambiguous situation of two Numbers
back to back. Anything that sends newline-delimited JSON will work
with that.

> > I get it: you have a bizarre set of tools and the normal solutions
> > don't work for you. But you can't complain about the tools not
> > supporting your use-cases. Just code up your own styles of doing
> > things that are unique to you.
>
> There are numerous tools that parse complete JSON documents fine.
> Framing JSON values with NUL-termination is trivial to add in any
> programming environment. For example:
>
>def json_docs(path):
>with open(path) as f:
>for doc in f.read().split("\0")[:-1].
>yield json.loads(doc)

Yes, but many text-processing tools don't let you manually insert
NULs. Of *course* you can put anything you like in there when you
control both ends and everything in between; that's kinda the point of
coding. But I'm going to use newlines, and parse as non-delimited,
since that can be done just as easily (see my example code earlier -
it could be converted into the same style of generator as you have
here and would be about as many lines), since that will behave as text
in most applications.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: multiple JSON documents in one file, change proposal

2018-12-01 Thread Marko Rauhamaa
Chris Angelico :

> On Sat, Dec 1, 2018 at 10:16 PM Marko Rauhamaa  wrote:
>> and the framing format is HTTP. I will need to type something like this:
>>
>>POST / HTTP/1.1^M
>>Host: localhost^M
>>Content-type: application/json^M
>>Content-length: 54^M
>>^M
>>{
>>"msgtype": "echo-req",
>>"opid": 3487547843
>>}
>>
>> That's almost impossible to type without a syntax error.
>
> 1) Set your Enter key to send CR-LF, at least for this operation.
> That's half your problem solved.

That can be much harder than typing ctrl-SPC. It *is* supported by
netcat, for example, but then you have to carefully recompute the
content-length field.

> 2) Send the request like this:
>
> POST / HTTP/1.0
> Content-type: application/json
>
> {"msgtype": "echo-req", "opid": 3487547843}
>
> Then shut down your end of the connection, probably with Ctrl-D. I'm
> fairly sure I can type that without bugs, and any compliant HTTP
> server should be fine with it.

If I terminated the input, I wouldn't need any framing. The point is to
carry a number of JSON messages/documents over a single bytestream or in
a single file. That means the HTTP content-length would be mandatory.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: multiple JSON documents in one file, change proposal

2018-12-01 Thread Ben Bacarisse
Marko Rauhamaa  writes:

> Chris Angelico :
>
>> On Sat, Dec 1, 2018 at 10:16 PM Marko Rauhamaa  wrote:
>>> and the framing format is HTTP. I will need to type something like this:
>>>
>>>POST / HTTP/1.1^M
>>>Host: localhost^M
>>>Content-type: application/json^M
>>>Content-length: 54^M
>>>^M
>>>{
>>>"msgtype": "echo-req",
>>>"opid": 3487547843
>>>}
>>>
>>> That's almost impossible to type without a syntax error.
>>
>> 1) Set your Enter key to send CR-LF, at least for this operation.
>> That's half your problem solved.
>
> That can be much harder than typing ctrl-SPC. It *is* supported by
> netcat, for example, but then you have to carefully recompute the
> content-length field.
>
>> 2) Send the request like this:
>>
>> POST / HTTP/1.0
>> Content-type: application/json
>>
>> {"msgtype": "echo-req", "opid": 3487547843}
>>
>> Then shut down your end of the connection, probably with Ctrl-D. I'm
>> fairly sure I can type that without bugs, and any compliant HTTP
>> server should be fine with it.
>
> If I terminated the input, I wouldn't need any framing. The point is to
> carry a number of JSON messages/documents over a single bytestream or in
> a single file. That means the HTTP content-length would be mandatory.

I haven't been following the details of the thread, but I wonder if a
multi-part form-data POST would be useful.  Way more complex than a
simple separator, of course, but more "standard" (for some rather vague
meaning of the term).

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: multiple JSON documents in one file, change proposal

2018-12-01 Thread Grant Edwards
On 2018-11-30, Marko Rauhamaa  wrote:
> Paul Rubin :
>> Maybe someone can convince me I'm misusing JSON but I often want to
>> write out a file containing multiple records, and it's convenient to
>> use JSON to represent the record data.
>>
>> The obvious way to read a JSON doc from a file is with "json.load(f)"
>> where f is a file handle. Unfortunately, this throws an exception
>
> I have this "multi-JSON" need quite often. In particular, I exchange
> JSON-encoded messages over byte stream connections. There are many ways
> of doing it. Having rejected different options ( https://en.wikipedia.org/wiki/JSON_streaming>), I settled with
> terminating each JSON value with an ASCII NUL character, which is
> illegal in JSON proper.

This is what "archive" file formats are for.  Just use tar, zip, ar,
cpio, or some other file format designed to store multiple "files" of
arbitrary data -- there are plenty of "standard" formats to choose
from and plenty of libraries to deal with them.

-- 
Grant

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ValueError vs IndexError, unpacking arguments with string.split

2018-12-01 Thread Morten W. Petersen
On Sat, Dec 1, 2018 at 1:11 AM Chris Angelico  wrote:

> On Sat, Dec 1, 2018 at 10:55 AM Morten W. Petersen 
> wrote:
> > But this raises the question of how to write Python code,
> > short and sweet, that could handle infinite iterators in
> > such an unpack with multiple variables to assign to.
> >
> > Which I guess is mostly theoretical, as there are other
> > ways of designing code to avoid needing to unpack
> > an infinite iterator using the * prefix.
>
> It could only be done with the cooperation of the iterable in
> question. For instance, a range object could implement a "remove
> first" operation that does this, and several itertools types wouldn't
> need to change at all. But it can't be done generically other than the
> way it now is (pump the iterator the rest of the way).
>


I wasn't able to follow this, could you elaborate?

Regards,

Morten


-- 
Videos at https://www.youtube.com/user/TheBlogologue
Twittering at http://twitter.com/blogologue
Blogging at http://blogologue.com
Playing music at https://soundcloud.com/morten-w-petersen
Also playing music and podcasting here:
http://www.mixcloud.com/morten-w-petersen/
On Google+ here https://plus.google.com/107781930037068750156
On Instagram at https://instagram.com/morphexx/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ValueError vs IndexError, unpacking arguments with string.split

2018-12-01 Thread Chris Angelico
On Sun, Dec 2, 2018 at 11:55 AM Morten W. Petersen  wrote:
>
> On Sat, Dec 1, 2018 at 1:11 AM Chris Angelico  wrote:
>>
>> On Sat, Dec 1, 2018 at 10:55 AM Morten W. Petersen  wrote:
>> > But this raises the question of how to write Python code,
>> > short and sweet, that could handle infinite iterators in
>> > such an unpack with multiple variables to assign to.
>> >
>> > Which I guess is mostly theoretical, as there are other
>> > ways of designing code to avoid needing to unpack
>> > an infinite iterator using the * prefix.
>>
>> It could only be done with the cooperation of the iterable in
>> question. For instance, a range object could implement a "remove
>> first" operation that does this, and several itertools types wouldn't
>> need to change at all. But it can't be done generically other than the
>> way it now is (pump the iterator the rest of the way).
>
> I wasn't able to follow this, could you elaborate?
>

The way *x works in unpacking is that the entire iterable gets
unpacked, and everything gets put into a list that is then assigned to
x. This is generic, works on any iterable, but doesn't take advantage
of anything. Consider:

>>> x = range(3, 20, 5)
>>> first, *rest = x
>>> first
3
>>> rest
[8, 13, 18]

If a range object were asked to yield its first-and-rest in this way,
it could instead return range(8, 20, 5) - add the step onto the start,
job done. Same if you ask for some off the beginning and some off the
end And with a number of the itertools iterables/iterators, the "rest"
wouldn't actually need to change anything, since the iterator will get
consumed. This would need explicit support from the iterable, though,
as Python can't know how to do this generically; so there would need
to be a protocol for "unpack".

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: multiple JSON documents in one file, change proposal

2018-12-01 Thread Akkana Peck
Grant Edwards  writes:
> This is what "archive" file formats are for.  Just use tar, zip, ar,
> cpio, or some other file format designed to store multiple "files" of
> arbitrary data -- there are plenty of "standard" formats to choose
> from and plenty of libraries to deal with them.

Then the data files wouldn't be human readable, making debugging
a lot more hassle.

Cameron Simpson writes:
> There's a common format called Newline Delimited JSON (NDJSON) for just this
> need.
> 
> Just format the outbound records as JSON with no newlines (i.e. make the
> separator a space or the empty string), put a newline at the end of each.
> 
> On ingest, read lines of text, and JSON parse each line separately.

I'll second that. It's very easy to deal with. A shorter name for it
is JSONL -- I use .jsonl for filenames.
https://en.wikipedia.org/wiki/JSON_streaming#Line-delimited_JSON
discusses that as well as several other solutions.

...Akkana
-- 
https://mail.python.org/mailman/listinfo/python-list