from:"Dan Christian"

[issue1521950] shlex.split() does not tokenize like the shell

2011-11-24 Thread Dan Christian

Dan Christian  added the comment:

Of course, that's how it's used.  That's all it can do right now.

I was was splitting and combining commands (using ;, &&, and ||) and
then running the resulting (mega) one liners over ssh.  It still gets
run by a shell, but I was specifying the control flow. 0

 It's kind of like a makefile command block.  You want to be able to
specify if a failure aborts the sequence, or is ignored (&& vs ;).
Sometimes there are fallback commands (via ||).  Of course, you can
also group using ().

Once things are split properly, then understanding the shell control
characters is straight forward.  I my mind, shlex.split() should
either be as close to shell syntax as possible, or have a clear
explanation of what is different (and why).

I ended up doing my own parsing.  I'm not actually at that company
anymore, so I can't pull up the code.

I'll see if I can come up with a reference case and maybe a unittest
this weekend (that's really the only time I'll have to dig into it).

-Dan

On Thu, Nov 24, 2011 at 9:20 AM, Éric Araujo  wrote:
>
> Éric Araujo  added the comment:
>
> Thanks for the comments.
>
>> There are really two cases in one bug.
>> The first part is that the shell will split tokens at characters that shlex 
>> doesn't.  The handling
>> of &, |, ;, >, and < could be done by adjusting the definition of 
>> shlex.wordchars.  The shell may
>> also understands things like: &&, ||, |&, and >&.  The exact definition of 
>> these depends on the
>> shell, so maybe it's best to just split them out as separate tokens and let 
>> the user figure out the
>> compound meanings.
> Yes.  I think that the main use of shlex is really to parse a line into 
> chunks with a way to embed spaces; it’s intended to parse a program command 
> line (“prog --blah "value stillthesamevalue" "arg samearg"”), but not 
> necessarily a full shell line (with & and | and whatnot).  When people have a 
> line containing & and |, then they need a shell to execute it, so they would 
> not call shlex.split but just pass the full line to os.system or 
> subprocess.Popen.  Do you remember what use cases you had when you opened 
> this report?
>
>> The proper handling of quotes/escapes requires some kind of new interface.  
>> You need to distinguish
>> between tokens that were modified by the quote/escape rules and those that 
>> were not.
> I don’t see why I would care about quotes in the result of shlex.split.
>
> See also #7611.
>
> --
>
> ___
> Python tracker 
> <http://bugs.python.org/issue1521950>
> ___
>

--

___
Python tracker 
<http://bugs.python.org/issue1521950>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1521950] shlex.split() does not tokenize like the shell

2011-11-25 Thread Dan Christian

Dan Christian  added the comment:

I've attached a diff to test_shlex.py and a script that I used to
verify what the shells actually do.
Both are relative to Python-3.2.2/Lib/test

I'm completely ignoring the quotes issue for now.  That should
probably be an enhancement.  I don't think it really matters until the
parsing issues are resolved.

ref_shlex is python 2 syntax.  python -3 shows that it should convert cleanly.
./ref_shlex.py
It will run by default against /bin/*sh
If you don't want that, do something like: export SHELLS='/bin/sh,/bin/csh'
It runs as a unittest.  So you will only see dots if all shells do
what it expects.  Some shells are flaky (e.g. zsh, tcsh), so you may
need to run it multiple times.

Getting this into the mainline will be interesting.  I would think it
would take some community discussion.  I may be able to convince
people that the current behaviour is wrong, but I can't tell you what
will break if it is "fixed".  And should the fix be the default?  As
you mentioned, it depends on what people expect it to do and how it is
currently being used.  I see the first step as presenting a clear case
of how it should work.

-Dan

On Fri, Nov 25, 2011 at 10:01 AM, Éric Araujo  wrote:
>
> Éric Araujo  added the comment:
>
>> Of course, that's how it's used.  That's all it can do right now.
> :) What I meant is that it is *meant* to be used in this way.
>
>> I was was splitting and combining commands (using ;, &&, and ||) and then 
>> running the resulting
>> (mega) one liners over ssh.  It still gets run by a shell, but I was 
>> specifying the control flow.
> Thank you for the reply.  It is indeed a valuable use case to pass a command 
> line as one string to ssh, and the split/quote combo should round-trip and be 
> useful for this usage.
>
>> I'll see if I can come up with a reference case and maybe a unittest this 
>> weekend
> Great!  A new argument (with a default value which gets us the previous 
> behavior) will probably be needed, to preserve backward compatibility.
>
> --
> nosy: +niemeyer
> versions: +Python 3.3 -Python 3.2
>
> ___
> Python tracker 
> <http://bugs.python.org/issue1521950>
> ___
>

--
keywords: +patch
Added file: http://bugs.python.org/file23778/ref_shlex.py
Added file: http://bugs.python.org/file23779/test_shlex.diff

___
Python tracker 
<http://bugs.python.org/issue1521950>
___#!/usr/bin/env python

"""Test how various shells parse syntax.
This is only expected to work on Unix based systems.
We use the unittest infrastructure, but this isn't a normal test.

Usage:
  ref_shelex.py [options] shells...
"""
# Written by: Dan Christian for issue1521950

import glob
import re
import os, sys
import optparse
import subprocess
import unittest

TempDir = '/tmp' # where we will write temp files
Shells = ['/bin/sh', '/bin/bash'] # list of shells to test against

class ShellTest(unittest.TestCase):
bgRe = re.compile(r'\[\d+\]\s+(\d+|\+ Done)$') # backgrounded command output

def Run(self,
shell,   # shell to use
command, # command to run
filepath=None):  # any files that are expected
"""Carefully run a shell command.
Capture stdout, stderr, and exit status.
Returns: (ret, out, err)
   ret is the return status
   out is the list of lines to stdout
   err is the list of lines to stderr
"""
start_cwd = os.getcwd()
call = [shell, '-c', command]
#print "Running: %s -c '%s'" % (shell, command)
outpath = 'stdout.txt'
errpath = 'stderr.txt'
ret = -1
out = None
err = None
fileout = None
try:
os.chdir(TempDir)
outfp = open(outpath, 'w')
errfp = open(errpath, 'w')
if filepath and os.path.isfile(filepath):
os.remove(filepath)
ret = subprocess.call(call, stdout=outfp, stderr = errfp)
#print "Returned: %d" % ret
outfp = open(outpath, 'r')
out = outfp.readlines()
os.remove(outpath)
errfp = open(errpath, 'r')
err = errfp.readlines()
os.remove(errpath)
if filepath:
ffp = open(filepath)
fileout = ffp.readlines()
os.remove(filepath)
except OSError as msg:
print "Exception!", msg

[issue1521950] shlex.split() does not tokenize like the shell

2011-11-25 Thread Dan Christian

Dan Christian  added the comment:

I just realized that I left out a major case.  The shell will also
split ().  I think this is now complete.  If you do "man bash" and
skip down to DEFINITONS it lists all the control characters.

I've attached updated versions of ref_shlex.py and test_shlex.diff.
They replace the previous ones.

-Dan

On Fri, Nov 25, 2011 at 12:25 PM, Dan Christian  wrote:
>
> Dan Christian  added the comment:
>
> I've attached a diff to test_shlex.py and a script that I used to
> verify what the shells actually do.
> Both are relative to Python-3.2.2/Lib/test
>
> I'm completely ignoring the quotes issue for now.  That should
> probably be an enhancement.  I don't think it really matters until the
> parsing issues are resolved.
>
> ref_shlex is python 2 syntax.  python -3 shows that it should convert cleanly.
> ./ref_shlex.py
> It will run by default against /bin/*sh
> If you don't want that, do something like: export SHELLS='/bin/sh,/bin/csh'
> It runs as a unittest.  So you will only see dots if all shells do
> what it expects.  Some shells are flaky (e.g. zsh, tcsh), so you may
> need to run it multiple times.
>
> Getting this into the mainline will be interesting.  I would think it
> would take some community discussion.  I may be able to convince
> people that the current behaviour is wrong, but I can't tell you what
> will break if it is "fixed".  And should the fix be the default?  As
> you mentioned, it depends on what people expect it to do and how it is
> currently being used.  I see the first step as presenting a clear case
> of how it should work.
>
> -Dan

--
Added file: http://bugs.python.org/file23780/ref_shlex.py
Added file: http://bugs.python.org/file23781/test_shlex.diff

___
Python tracker 
<http://bugs.python.org/issue1521950>
___#!/usr/bin/env python

"""Test how various shells parse syntax.
This is only expected to work on Unix based systems.
We use the unittest infrastructure, but this isn't a normal test.

Usage:
  ref_shelex.py [options] shells...
"""
# Written by: Dan Christian for issue1521950
# References: man bash   # look at DEFINITIONS and SHELL GRAMMAR

import glob
import re
import os, sys
import subprocess
import unittest

TempDir = '/tmp' # where we will write temp files
Shells = ['/bin/sh', '/bin/bash'] # list of shells to test against

class ShellTest(unittest.TestCase):
bgRe = re.compile(r'\[\d+\]\s+(\d+|\+ Done)$') # backgrounded command output

def Run(self,
shell,   # shell to use
command, # command to run
filepath=None):  # any files that are expected
"""Carefully run a shell command.
Capture stdout, stderr, and exit status.
Returns: (ret, out, err)
   ret is the return status
   out is the list of lines to stdout
   err is the list of lines to stderr
"""
start_cwd = os.getcwd()
call = [shell, '-c', command]
#print "Running: %s -c '%s'" % (shell, command)
outpath = 'stdout.txt'
errpath = 'stderr.txt'
ret = -1
out = None
err = None
fileout = None
try:
os.chdir(TempDir)
outfp = open(outpath, 'w')
errfp = open(errpath, 'w')
if filepath and os.path.isfile(filepath):
os.remove(filepath)
ret = subprocess.call(call, stdout=outfp, stderr = errfp)
#print "Returned: %d" % ret
outfp = open(outpath, 'r')
out = outfp.readlines()
os.remove(outpath)
errfp = open(errpath, 'r')
err = errfp.readlines()
os.remove(errpath)
if filepath:
ffp = open(filepath)
fileout = ffp.readlines()
os.remove(filepath)
except OSError as msg:
print "Exception!", msg
os.chdir(start_cwd)
# leave files behind for debugging
self.assertTrue(0, "Hit an exception running: " % (
' '.join(call)))
return (ret, out, err, fileout)

def testTrue(self):
""" Trivial case to test execution. """
for shell in Shells:
cmd = '/bin/true'
(ret, out, err, fout) = self.Run(shell, cmd)
self.assertEquals(
0, ret,
"Expected %s -c '%s' to return 0, not %d" % (shell, cmd, ret))
self.assertEq

[issue1521950] shlex.split() does not tokenize like the shell

2011-11-26 Thread Dan Christian

Dan Christian  added the comment:

On Sat, Nov 26, 2011 at 7:12 AM, Éric Araujo  wrote:
> Your script passes with dash, which is probably the most POSIX-compliant 
> shell we can find.  (bash has extensions, zsh/csh don’t use the POSIX shell 
> language, so I think the behavior of dash should be our reference, not the 
> bash man page.)

I was just looking for a reference where I didn't have to sift through
tons of documentation.  Most systems have bash.  Before that I was
just working from experience (I've done a lot of shell scripting).

> there is code out there that depends on the current behavior of shlex and 
> does not need to support && || ; ( ), if we add support for these tokens we 
> should not break the existing code.

Here's a thought on how that might work (just brainstorming).  shlex
uses a series of character strings to drive it's parsing:  whitespace,
escape, quotes.  Add another one: control = '();<>|&'.  If it is unset
(by default?), then the behavior is as before.  If it is set, then
shlex will output any character in control as a separate token.

There might be a shell specific script (or maybe it's left to the
user) that decides that certain tokens can be recombined:  '&&', '||',
'|&', '>>', etc.  This code is pretty simple:  walk the token
sequence, if you see a two token pair, pop the second and combine it
into the first.

-Dan

--

___
Python tracker 
<http://bugs.python.org/issue1521950>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1521950] shlex.split() does not tokenize like the shell

2011-11-26 Thread Dan Christian


Dan Christian  added the comment:

> Sure :)  That’s why I suggest using dash for quick tests and rely on the work 
> of other people who did read the POSIX spec.  I’ll have to check it too 
> before committing a patch.

The point of ref_shlex.py is that all shells act the same for common
cases and shlex doesn't match any of them.  The only real split it
that csh based shells do some things differently that sh based shells
('2>' vs '&>').

>> shlex uses a series of character strings to drive it's parsing:  whitespace, 
>> escape, quotes.
>> Add another one: control = '();<>|&'.  If it is unset (by default?), then 
>> the behavior is as
>> before.
> So we would need to add a Shlex subclass to the module to provide the new 
> behavior.  I think I prefer a new argument, because we can just extend the 
> existing class and functions instead of adding subtly differing duplicates.

You don't have to do a subclass (although that might have some
advantages).  You could do something like:
def shlex(s, comments=False, posix=True, control=False):
...
  if control:
if control is True:
  self.control = '();<>|&'
else:
  self.control = control  # let user specify their own control set

>> If it is set, then shlex will output any character in control as a separate 
>> token.
> Unless it is part of a quoted segment, right?  (See #7611 for 'foo#bar' vs. 
> 'foo #bar').

Correct, quotes wouldn't change.

>> There might be a shell specific script (or maybe it's left to the user)
>> that decides that certain tokens can be recombined:
> Seems to much complexity.  I really prefer if we agree on one command parsing 
> behavior (POSIX, i.e. dash) and improve shlex to support that.  People 
> wanting zsh rules can write their own subclass.

shlex is a pretty simple lexer (as lexers go), and I wouldn't want it
to get complicated.  It's easier in the current code structure to
split everything and then re-join as needed.  This also allows you to
select sh vs csh joining rules (e.g. '|&' means different things in sh
vs csh).  Every shell that I've seen follows one of those two flavors
for syntax.

>> '&&', '||', '|&', '>>', etc.
> Wouldn’t it be more correct to consider them different tokens?  I don’t have 
> a format training in CS or programming, so I’m not sure that my definition is 
> correct at all, but in my mind a token is a unit, and thus & and && are two 
> different things.

Ideally, the final tokens have exact meanings.  It easier to write
handler code for '&&' than ('&', '&').  This is just a case of whether
the parse joins them together or it's done in a second step.  The
current code doesn't do much look ahead, so it's hard for the lexer to
produce things like '&&' directly.

-Dan

--

___
Python tracker 
<http://bugs.python.org/issue1521950>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1521950] shlex.split() does not tokenize like the shell

2010-09-03 Thread Dan Christian

Dan Christian  added the comment:

It's been a while since I looked at this.  I'm not really in a
position to contribute code/tests right now; but I can comment.

I don't think POSIX mode existed when I first reported this, but
that's where it makes sense.  I think all POSIX shells (borne, C,
korne), will behave the same way for the issues mentioned.

There are really two cases in one bug.

The first part is that the shell will split tokens at characters that
shlex doesn't.  The handling of &, |, ;, >, and < could be done by
adjusting the definition of shlex.wordchars.  The shell may also
understands things like: &&, ||, |&, and >&.  The exact definition of
these depends on the shell, so maybe it's best to just split them out
as separate tokens and let the user figure out the compound meanings.

The proper handling of quotes/escapes requires some kind of new
interface.  You need to distinguish between tokens that were modified
by the quote/escape rules and those that were not.  One suggestion is
to add a new method as such:

shlex.get_token2()
   Return a tuple of the token and the original text of the token
(including quotes and escapes).  Otherwise, this is the same as
shlex.get_token().

Comparing the two values for equality (or maybe identity) would tell
you if something special was going on.  You can always pass the second
value to a reconstructed command line without losing any of the
original parsing information.

-Dan

On Fri, Sep 3, 2010 at 10:27 AM, Éric Araujo  wrote:
>
> Éric Araujo  added the comment:
>
> Thanks for the report. Would you like to work on a patch, or translate your 
> examples into unit tests?
>
> The docs do not mention “&” at all, and platform discrepancies have to be 
> taken into account too, so I really don’t know if this is a bug fix for the 
> normal mode, the POSIX mode, or a feature request requiring a new argument to 
> the shlex function to preserve compatibility.
>
> --
> nosy: +eric.araujo, eric.smith
>
> ___
> Python tracker 
> <http://bugs.python.org/issue1521950>
> ___
>

--

___
Python tracker 
<http://bugs.python.org/issue1521950>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1521950] shlex.split() does not tokenize like the shell

2012-02-23 Thread Dan Christian


Dan Christian  added the comment:

I haven't been following this much.  Sorry.  My day job isn't in this area any 
more (and I'm stuck using 2.4 :-().

Looking at the docs, I notice the "old" is different from what it used to be.  
Notably: 'e;' gets split into two tokens; and ">'abc';" gets split into 3.  I'm 
pretty sure that baseline code doesn't split those at all.  So there is a 
question of if "old" is fully backward compatible.

The "new" functionality looks great.  That's what I was looking for when I 
filed the bug.

Thank you!
-Dan

--

___
Python tracker 
<http://bugs.python.org/issue1521950>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue1521950] shlex.split() does not tokenize like the shell

[issue1521950] shlex.split() does not tokenize like the shell

[issue1521950] shlex.split() does not tokenize like the shell

[issue1521950] shlex.split() does not tokenize like the shell

[issue1521950] shlex.split() does not tokenize like the shell

[issue1521950] shlex.split() does not tokenize like the shell

[issue1521950] shlex.split() does not tokenize like the shell

7 matches

Site Navigation

Mail list logo

Footer information