Hello J.R.M. Hosking, charToRaw() works perfectly, thank you:
> charToRaw(as.character(moose[1, "V3"])) [1] 24 38 38 30 2c 33 37 30 c2 a0 gsub("[[:space:]]", "", ...) did not remove them, but now I know what they are (hex: c2 a0) I can remove them with gsub() by: > gsub("[$,\xc2\xa0]", "", as.character(moose[1, "V3"])) [1] "880370" Kind regards, -Mark- 2010/8/24 J. R. M. Hosking <jrmh...@gmail.com> > On 2010-08-23 11:03, Mark Breman wrote: > >> Hello everyone, >> >> I am reading a HTML table from a website with readHTMLTable() from the XML >> package: >> >> library(XML) >>> moose = readHTMLTable("http://www.decisionmoose.com/Moosistory.html", >>> >> header=FALSE, skip.rows=c(1,2), trim=TRUE)[[1]] >> >>> moose >>> >> V1 V2 V3 >> 1 07.02.2010 SWITCH to Long Bonds\n (BTTRX) $880,370 >> 2 05.07.2010 Switch to Gold (GLD) $878,736 >> 3 03.05.2010 Switch to US Small-cap Equities (IWM) $895,676 >> 4 01.22.2010 Switch to Cash (3moT) $895,572 >> ..... truncated by me! >> >> I am interested in the values in the third column: >> >> as.character(moose$V3) >>> >> [1] "$880,370 " "$878,736 " "$895,676 " "$895,572 " "$932,139 " >> "$932,131 " "$1,013,505 " "$817,451 " "$817,082 " "$848,133" >> [11] "$904,527 " " $903,981 " "$902,582 " "$896,170 " "$809,853 " >> " >> $808,852 " " $807,409 " "$802,658 " "$747,629 " "$672,465 " >> [21] " $671,826 " "$645,352 " "$615,174 " "$609,415 " " $590,664 " >> " >> $586,785 " "$561,056 " "$537,307 " " $535,744 " " $552,712 " >> [31] "$551,615 " " $508,790 " "$501,161 " "$499,023 " " $446,568 " >> "$423,727 " "$421,967 " "$396,007 " "$395,943 " " $270,011 " >> [41] "$264,386 " "$278,513 " "$251,855 " "$251,685 " " $129,198 " >> "$127,541 " "$117,381 " "$100,000 " " " " $275,417" >> [51] "$266,459" " $214,552" "$207,312" "$173,557" "$167,647" >> "$150,516" "$135,842" "$126,667" "$131,642" "$113,804" >> [61] "$107,364" "$108,242" " $102,881" " $100,000" >> >> Notice the spaces leading and lagging some of the values. >> >> I want to get the values as numeric values, so I try to get rid of the >> $-character and comma's with gsub() and a regular expression: >> >> gsub("[$,]", "", as.character(moose$V3)) >>> >> [1] "880370 " "878736 " "895676 " "895572 " "932139 " "932131 " >> "1013505 " "817451 " "817082 " "848133 " "904527 " " 903981 " "902582 >> " >> [14] "896170 " "809853 " " 808852 " " 807409 " "802658 " "747629 " >> "672465 " " 671826 " "645352 " "615174 " "609415 " " 590664 " " >> 586785 >> " >> [27] "561056 " "537307 " " 535744 " " 552712 " "551615 " " 508790 " >> "501161 " "499023 " " 446568 " "423727 " "421967 " "396007 " "395943" >> [40] " 270011 " "264386 " "278513 " "251855 " "251685 " " 129198 " >> "127541 " "117381 " "100000 " " " " 275417" "266459" " >> 214552" >> [53] "207312" "173557" "167647" "150516" "135842" "126667" >> "131642" "113804" "107364" "108242" " 102881" " 100000" >> >> Looks fine to me. Now I can use as.numeric() to convert to numbers >> (leading >> and lagging spaces should not be a problem): >> >> as.numeric(gsub("[$,]", "", as.character(moose$V3))) >>> >> [1] NA NA NA NA NA NA NA NA NA NA >> NA NA NA NA NA NA NA NA NA NA >> [21] NA NA NA NA NA NA NA NA NA NA >> NA NA NA NA NA NA NA NA NA NA >> [41] NA NA NA NA NA NA NA NA NA NA >> 266459 NA 207312 173557 167647 150516 135842 126667 131642 113804 >> [61] 107364 108242 NA NA >> Warning message: >> NAs introduced by coercion >> >> Something is wrong here! Let's have a look at one specific value: >> >> gsub("[$,]", "", as.character(moose$V3))[1] >>> >> [1] "880370 " >> >>> as.numeric(gsub("[$,]", "", as.character(moose$V3))[1]) >>> >> [1] NA >> Warning message: >> NAs introduced by coercion >> >> If the last character in the string would be a regular space it would not >> be >> a problem for as.numeric(): >> >> as.numeric("880370 ") >>> >> [1] 880370 >> >> But it looks like it's not a regular space character: >> >> substr(gsub("[$,]", "", as.character(moose$V3))[1], 7, 7) == " " >>> >> [1] FALSE >> >> It looks to me the spaces in some of the cells are not regular spaces. In >> the original HTML table they are defined as "non breaking spaces" i.e. >> >> >> So my question is WHAT ARE THEY? >> Is there a way to show the binary (hex) values of these characters? >> > > charToRaw(...) will show them > > gsub("[[:space:]]", "", ...) may remove them > > > J. R. M. Hosking > > >> Here is my environment: >> >> sessionInfo() >>> >> R version 2.11.1 (2010-05-31) >> i486-pc-linux-gnu >> >> locale: >> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C >> LC_TIME=en_US.utf8 >> LC_COLLATE=en_US.utf8 LC_MONETARY=C >> [6] LC_MESSAGES=en_US.utf8 LC_PAPER=en_US.utf8 LC_NAME=C >> LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] XML_3.1-0 >> >> loaded via a namespace (and not attached): >> [1] tools_2.11.1 >> >> Thanks, >> >> -Mark- >> >> [[alternative HTML version deleted]] >> >> > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.