Re: [Tutor] My problem in simple terms

2019-03-22 Thread Edward Kanja
Greetings,
I'm referring to my question i sent earlier, kindly if you have a hint on
how i can solve
my problem i will really appreciate. After running regular expressions
using python
my output has lot of square brackets i.e. [][][][][][][][][]. How do i
substitute this with empty
string so as to have a clear output which i will latter export to an excel
file.
Thanks a lot.



Regards,
Kanja Edward.
P.O.BOX 1203-00300,
NAIROBI.
*+254720724793*
www.linkedin.com/in/edward-kanja-bba16a106 


On Mon, Mar 4, 2019 at 2:44 PM Edward Kanja 
wrote:

> Hi there ,
> Earlier i had sent an email on how to use re.sub function to eliminate
> square brackets. I have simplified the statements. Attached txt file named
> unon.Txt has the data im extracting from. The file named code.txt has the
> codes I'm using to extract the data.The regular expression works fine but
> my output has too many square brackets. How do i do away with them thanks.
>
>
>
> Regards,
> Kanja Edward.
> P.O.BOX 1203-00300,
> NAIROBI.
> *+254720724793*
> www.linkedin.com/in/edward-kanja-bba16a106 
>
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] My problem in simple terms

2019-03-22 Thread Mats Wichmann
On 3/21/19 11:54 PM, Edward Kanja wrote:
> Greetings,
> I'm referring to my question i sent earlier, kindly if you have a hint on
> how i can solve
> my problem i will really appreciate. After running regular expressions
> using python
> my output has lot of square brackets i.e. [][][][][][][][][]. How do i
> substitute this with empty
> string so as to have a clear output which i will latter export to an excel
> file.
> Thanks a lot.

I think you got the key part of the answer already: you're getting empty
lists as matches, which when printed, look like []. Let's try to be more
explicit:

$ python3
Python 3.7.2 (default, Jan 16 2019, 19:49:22)
[GCC 8.2.1 20181215 (Red Hat 8.2.1-6)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> help(re.findall)

Help on function findall in module re:

findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.

If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.

Empty matches are included in the result.


re.findall *always* returns a list, even if there is no match.  If we
add more debug prints in your code so it looks like this:


import re

with open ('unon.txt') as csvfile:

for line in csvfile:

print("line=", line)
index_no=re.findall(r'(\|\s\d{5,8}\s)',line)
print("index_no (type %s)" % type(index_no), index_no)

names=re.findall(r'(\|[A-Za-z]\w*\s\w*\s\w*\s\w*\s)',line)
print("names (type %s)" % type(names), names)
#Address=re.findall(r'\|\s([A-Z0-9-,/]\w*\s\w*\s)',line)

duty_station=re.findall(r'\|\s[A-Z]*\d{2}\-\w\w\w\w\w\w\w\s',line)
print("duty_station (type %s)" % type(duty_station), duty_station)


You can easily see what happens as your data is processed - I ran this
on your data file and the first few times through looks like this:

line=
--

index_no (type ) []
names (type ) []
duty_station (type ) []
line= |Rawzeea NLKPP | VE11-Nairobi
   | 20002254-MADIZ| 00   |
00   |Regular Scheme B | 15-JAN-2019 To 31-DEC-2019 | No   |

index_no (type ) []
names (type ) ['|Rawzeea NLKPP   ']
duty_station (type ) ['| VE11-Nairobi ']
line=
||

index_no (type ) []
names (type ) []
duty_station (type ) []


You see each result of re.findall has given you a list, and most are
empty.  The first and third lines are separators, containing no useful
data, and you get no matches at all. The second line provided you with a
match for "names" and for "duty_station", but not for "index_no".  Your
code will need to be prepared for those sorts of outcomes.

Just looking at the data, it's table data, presumably from a
spreadsheet, but does not really present in a format that is easy to
process, because individual lines are not complete.   A separator line
with all dashes seems to be the marker between complete entries, which
then take up 14 lines, including additional marker lines which follow
slightly different patterns - they may contain | marks or leading spaces.

You will need to decide how regular your table data is and how to work
with it, most examples of handling table data assume that one row is a
complete entry, so you probably won't find a lot of information on this.
 In your case I'm looking at line 2 containing 8 fields, line 4
containing 9 fields, line 6 10 fields, and then lines 8-14 being
relatively free-form consisting of multiple lines.

Is there any chance you can generate your data file in a different way
to make it easier to process?



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] (no subject)

2019-03-22 Thread Matthew Herzog
I have a Python3 script that reads the first eight characters of every
filename in a directory in order to determine whether the file was created
before or after 180 days ago based on each file's name. The file names all
begin with MMDD or erased_MMDD_etc.xls. I can collect all these
filenames already.
I need to tell my script to ignore any filename that does not conform to
the standard eight leading numerical characters, example: 20180922 or
erased_20171207_1oIkZf.so.
Here is my code.

if name.startswith('scrubbed_'):
fileDate = datetime.strptime(name[9:17], DATEFMT).date()
else:
fileDate = datetime.strptime(name[0:8], DATEFMT).date()

I need logic to prevent the script from 'choking' on files that don't fit
these patterns. The script needs to carry on with its work and forget about
non-conformant filenames. Do I need to add code that causes an exception or
just add an elif block?

Thanks.


-- 
The plural of anecdote is not "data."
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] (no subject)

2019-03-22 Thread Alan Gauld via Tutor

On 22/03/19 21:45, Matthew Herzog wrote:


I need to tell my script to ignore any filename that does not conform to
the standard eight leading numerical characters, example: 20180922 or
erased_20171207_1oIkZf.so.


Normally we try to dissuade people from using regex when string methods 
will do but in this case a regex sounds like it might be the best option.


A single if statement should suffice

if re.match("[0-9]{8}|erased_",fname):
   Your code here

There are other ways to write the regex but the above should
be clear...

HTH,

Alan G.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] (no subject)

2019-03-22 Thread Cameron Simpson

On 22Mar2019 17:45, Matthew Herzog  wrote:

I have a Python3 script that reads the first eight characters of every
filename in a directory in order to determine whether the file was created
before or after 180 days ago based on each file's name. The file names all
begin with MMDD or erased_MMDD_etc.xls. I can collect all these
filenames already.
I need to tell my script to ignore any filename that does not conform to
the standard eight leading numerical characters, example: 20180922 or
erased_20171207_1oIkZf.so.
Here is my code.

if name.startswith('scrubbed_'):
   fileDate = datetime.strptime(name[9:17], DATEFMT).date()
   else:
   fileDate = datetime.strptime(name[0:8], DATEFMT).date()

I need logic to prevent the script from 'choking' on files that don't fit
these patterns. The script needs to carry on with its work and forget about
non-conformant filenames. Do I need to add code that causes an exception or
just add an elif block?


Just an elif. Untested example:

 for name in all_the_filenames:
   if name.startswith('erased_') and has 8 digits after that:
   extract date after "erased" ...
   elif name starts with 8 digits:
   extract date at the start
   else:
 print("skipping", repr(name))
 continue
   ... other stuff using the extracted date ...

The "continue" causes execution to go straight to the next loop 
iteration, effectively skipping the rest of the loop body. An exception 
won't really do what you want (well, not neatly).


Alan's suggestion of a regexp may be a sensible way to test filenames 
for conformance to your rules. \d{8} matches 8 digits. It doesn't do any 
tighter validation such as sane year, month or day values:  
would be accepted. Which may be ok, and it should certainly be ok for 
your first attempt: tighten things up later.


Cheers,
Cameron Simpson 
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor