Re: [Tutor] how to extract data only after a certain condition is met

bob gailer Sun, 10 Oct 2010 14:30:18 -0700

 Emile beat me to it, but here goes anyway...

On 10/10/2010 3:35 PM, Josep M. Fontana wrote:

Hi,
First let me apologize for taking so long to acknowledge your answersand to thank you (Eduardo, Peter, Greg, Emile, Joel and Alan, sorry ifI left anyone) for your help and your time.
One of the reasons I took so long in responding (besides having gottenbusy with some urgent matters related to my work) is that I was a bitembarrassed at realizing how poorly I had defined my problem.As Alan said, I should at least have told you which operations weregiving me a headache. So I went back to my Python reference books totry to write some code and thus be able to define my problems moreprecisely. Only after I did that, I said to myself, I would come backto the list with more specific questions.
The only problem is that doing this made me painfully aware of howlittle Python I know. Well, actually my problem is not so much that Idon't know Python as that I have very little experience programming ingeneral. Some years ago I learned a little Perl and basically I usedit to do some text manipulation using regular expressions but that'sall my experience. In order to learn Python, I read a book called"Beginning Python: From Novice to Professional" and I was hoping thatjust by starting to use the knowledge I had supposedly acquired byreading that book to solve real problems related to my project I wouldlearn. But this turned out to be much more difficult than I hadexpected. Perhaps if I had worked through the excellent book/tutorialAlan has written (of which I was not aware when I started), I would bebetter prepared to confront this problem.
Anyway (sorry for the long intro), since Emile laid out the problemvery clearly, I will use his outline to point out the problems I'm having:
Emile says:
--------------
Conceptually, you'll need to:

  -a- get the list of file names to change then for each
  -b- determine the new name
  -c- rename the file

For -a- you'll need glob. For -c- use os.rename.  -b- is a bit more
involved.  To break -b- down:

  -b1- break out the x-xx portion of the file name
  -b2- look up the corresponding year in the other file
  -b3- convert the year to the century-half structure
  -b4- put the pieces together to form the new file name

For -b2- I'd suggest building a dictionary from your second files
contents as a first step to facilitate the subsequent lookups.

---------------------
OK. Let's start with -b- . My first problem is that I don't reallyknow how to go about building a dictionary from the file with thecomma separated values. I've discovered that if I use a file methodcalled 'readlines' I can create a list whose elements would be each ofthe lines contained in the document with all the codes followed bycomma followed by the year. Thus if I do:
fileNameCentury =open(r'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt').readlines()
Where 'FileNamesYears.txt' is the document with the following info:

A-01, 1278
A-02, 1501
...
N-09, 1384
I get a list of theform ['A-01,1374\rA-02,1499\rA-05,1449\rA-06,1374\rA-09, ...]

I'm guessing that you are running on a Linux system and that the filecame from a Mac. This is based on the fact that \r appears in the stringinstead of acting as a line separator.


Regardless -
dct = {}

fileNameCentury = fileNameCentury.split('\r') # gives you ['A-01,1374','A-02,1499', 'A-05,1449', 'A-06,1374', 'A-09, ...]

for pair in fileNameCentury:
  key,value = pair.split(',')
  dct[key] = value

Greg mentioned the csv module. I checked the references but I couldnot see any way in which I could create a dictionary using that module.

True - the csv reader is just another way to get the list of pairs.

Once I have the dictionary built, what I would have to do is use theos module (or would it be the glob module?) to get a list of the filenames I want to change and build another loop that would iterate overthose file names and, if the first part of the name (possiblyrepresented by a regular expression of the form r'[A-Z]-[0-9]+')matches one of the keys in the dictionary, then a) it would get thevalue for that key, b) would do the numerical calculation to determinewhether it is the first part of the century or the second part and c)would insert the string representing this result right before theextension .txt.
In the abstract it sounds easy, but I don't even know how to start.Doing some testing with glob I see that it returns a list of stringsrepresenting the whole paths to all the files whose names I want tomanipulate. But in the reference documents that I have consulted, Isee no way to change those names. How do I go about inserting theinformation about the century right before the substring '.txt'?

Suppose fn = "blah.txt"
fn2 = f

As you see, I am very green. My embarrassment at realizing how basicmy problems were made me delay writing another message but I decidedthat if I don't do it, I will never learn.


Again, thanks so much for all your help.

Josep M.

    Message: 2
    Date: Sat, 2 Oct 2010 17:56:53 +0200
    From: "Josep M. Fontana" <josep.m.font...@gmail.com
    <mailto:josep.m.font...@gmail.com>>
    To: tutor@python.org <mailto:tutor@python.org>
    Subject: [Tutor] Using contents of a document to change file names
    Message-ID:
    <aanlktikjofyhiel70e=-bae_pedc0ng+igy3j+qo+...@mail.gmail.com
    <mailto:bae_pedc0ng%2bigy3j%2bqo%2b...@mail.gmail.com>>
    Content-Type: text/plain; charset="iso-8859-1"

    Hi,

    This is my first posting to this list. Perhaps this has a very
    easy answer
    but before deciding to post this message I consulted a bunch of Python
    manuals and on-line reference documents to no avail. I would be very
    grateful if someone could lend me a hand with this.

    Here's the problem I want to solve. I have a lot of files with the
    following
    name structure:

    A-01-namex.txt
    A-02-namey.txt
    ...
    N-09-namez.txt

    These are different text documents that I want to process for an
    NLP project
    I'm starting. Each one of the texts belongs to a different century
    and it is
    important to be able to include the information about the century
    in the
    name of the file as well as inside the text.

    Then I have another text file containing information about the
    century each
    one of the texts was written. This document has the following
    structure:

    A-01, 1278
    A-02, 1501
    ...
    N-09, 1384

    What I would like to do is to write a little script that would do the
    following:

    . Read each row of the text containing information about the
    centuries each
    one of the texts was written
    . Change the name of the file whose name starts with the code in
    the first
    column in the following way

           A-01-namex.txt --> A-01-namex_13-2.txt

       Where 13-1 means: 13th 2nd half. Obviously this information
    would com
    from the second column in the text: 1278 (the first two digits + 1 =
    century; if the 3rd and 4th digits > 50, then 2; if < 50 then     1)

    Then in the same script or in a new one, I would need to open each
    one of
    the texts and add information about the century they were written
    on the
    first line preceded by some symbol (e.g @13-2)

    I've found a lot of information about changing file names (so I
    know that I
    should be importing the os module), but none of the examples that
    were cited
    involved getting the information for the file changing operation
    from the
    contents of a document.

    As you can imagine, I'm pretty green in Python programming and I
    was hoping
    the learn by doing method would work.  I need to get on with this
    project,
    though, and I'm kind of stuck. Any help you guys can give me will
    be very
    helpful.

    Josep M.




_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor



--
Bob Gailer
919-636-4239
Chapel Hill NC

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] how to extract data only after a certain condition is met

Reply via email to