Re: [R] Text Input from a Non Delimited File

Duncan Murdoch Sun, 09 Feb 2014 15:47:06 -0800

On 14-02-09 5:56 PM, Burhan ul haq wrote:

Hi,


Minor Additions:

The original file was as follows:

##  -------------------------------------------------------------------
GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
1 10038 Carl Allwood M Sutton & Ashﬁeld Harriers 02:38:40 1 02:38:40
2 10098 Adam Holland M Votwo/USN 02:41:25 2 02:41:25
3 13007 Pumlani Bangani M 02:43:23 3 02:43:23
4 10028 Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
5 10187 Peter Stockdale M 02:45:26 5 02:45:25
6 10064 Jared Bethell M Harlow RC 02:46:43 6 02:46:40
7 13003 Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44
8 13009 Rod Harris M 02:47:47 8 02:47:45
9 10033 Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58
10 10037 Peter Swaine M Charnwood AC 02:49:28 10 02:49:27
11 10048 Pavel Toropov M 02:50:41 11 02:50:41
12 10008 Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40
13 10044 Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15
14 10380 Ludovic Renou M 02:53:37 14 02:53:34
15 10056 Alex Keenan M 02:53:48 15 02:53:47
##  -------------------------------------------------------------------

Available here:
http://www.coltishalljaguars.co.uk/wp-content/uploads/2011/09/Robin-hood2011.pdf

I am able to match a single entry with the regular expression:
^(\d+),(\d+),( )(.)*(M |F )(.)*(\d{2}):(\d{2}):(\d{2})( )(\d{1,})(
)(\d{2}):(\d{2}):(\d{2})

But unable to handle the back reference mechanism well. And put commas
to delimit the text.

I believe "regular expressions" pertain to R as much as they do to
Sublime, but please let me know, if I should be posting this to
"sublime" forum.

I would do the field extraction in R. Read the file using readLines(),then use regular expressions to extract the fields one at a time. Youcould identify them all in one RE, but why not break it down intosimpler problems?


By field extraction, I mean things like this:

lines <- readLines(...)
field1 <- sub(",.*", "", lines)
field2 <- sub(".*,(\\d+),.*", "\\1", lines)

etc.

Duncan Murdoch




\\Cheers


On Mon, Feb 10, 2014 at 3:48 AM, Burhan ul haq <ulh...@gmail.com> wrote:

Hi,

I am trying to read in a file, which is not delimited by any specific
characters.

Something as follows:
##  -------------------------------------------------------------------
GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
1,10038, Carl Allwood M Sutton & Ashﬁeld Harriers 02:38:40 1 02:38:40
2,10098, Adam Holland M Votwo/USN 02:41:25 2 02:41:25
3,13007, Pumlani Bangani M 02:43:23 3 02:43:23
4,10028, Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
5,10187, Peter Stockdale M 02:45:26 5 02:45:25
6,10064, Jared Bethell M Harlow RC 02:46:43 6 02:46:40
7,13003, Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44
8,13009, Rod Harris M 02:47:47 8 02:47:45
9,10033, Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58
10,10037, Peter Swaine M Charnwood AC 02:49:28 10 02:49:27
11,10048, Pavel Toropov M 02:50:41 11 02:50:41
12,10008, Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40
13,10044, Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15
14,10380, Ludovic Renou M 02:53:37 14 02:53:34
15,10056, Alex Keenan M 02:53:48 15 02:53:47
##  -------------------------------------------------------------------


As I failed to read it in via R or Excel, I used a text editor with
regular expressions, sublime to be exact. I was trying to convert it
in CSV format, and was successful to put commas for the first two
entries, as follows:

##  -------------------------------------------------------------------
GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
1,10038, Carl Allwood ,M ,Sutton & Ashﬁeld Harriers 02:38:40 1 02:38:40
2,10098, Adam Holland ,M ,Votwo/USN 02:41:25 2 02:41:25
3,13007, Pumlani Bangani ,M ,02:43:23 3 02:43:23
4,10028, Anthony Jackson ,M ,Sittingbourne Striders 02:44:39 4 02:44:39
5,10187, Peter Stockdale ,M ,02:45:26 5 02:45:25
6,10064, Jared Bethell ,M ,Harlow RC 02:46:43 6 02:46:40
7,13003, Sarah Harris ,F ,35 Long Eaton RC 02:47:47 7 02:47:44
8,13009, Rod Harris ,M ,02:47:47 8 02:47:45
9,10033, Carl Sommer ,M ,Huncote Harriers 02:47:59 9 02:47:58
10,10037, Peter Swaine ,M ,Charnwood AC 02:49:28 10 02:49:27
11,10048, Pavel Toropov ,M ,02:50:41 11 02:50:41
12,10008, Derek Dunne ,M ,45 Treasury Running Club 02:51:42 12 02:51:40
13,10044, Matthew Nutt ,M ,Scunthorpe 02:52:20 13 02:52:15
14,10380, Ludovic Renou ,M ,02:53:37 14 02:53:34
15,10056, Alex Keenan ,M ,02:53:48 15 02:53:47
##  -------------------------------------------------------------------

I am failing after that, I tried to search the expression:
(.)*(\d{2}:\d{2}:\d{2})( )
and replace it with: \1,\2,\3, with the result:

##  -------------------------------------------------------------------
GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
,02:38:40, 1 02:38:40
  ,02:41:25, 2 02:41:25
##  -------------------------------------------------------------------

How do I fix the regular expression here. If you examine the later
entries some name contains hyphen, or have three parts, so other
approaches do not work well.

Secondly, is there a better way to handle this problem. The original
input file is in pdf format.I copied the text, and made a txt file out
of it.

The input txt file is attached.

Thanks in advance for any suggestions.

\\Cheers


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Text Input from a Non Delimited File

Reply via email to