Hello J.R.M. Hosking,
charToRaw() works perfectly, thank you:
> charToRaw(as.character(moose[1, "V3"]))
[1] 24 38 38 30 2c 33 37 30 c2 a0
gsub("[[:space:]]", "", ...) did not remove them, but now I know what they
are (hex: c2 a0) I can remove them with gsub() by:
> gsub("[$,\xc2\xa0]", "", as.character(moose[1, "V3"]))
[1] "880370"
Kind regards,
-Mark-
2010/8/24 J. R. M. Hosking <[email protected]>
> On 2010-08-23 11:03, Mark Breman wrote:
>
>> Hello everyone,
>>
>> I am reading a HTML table from a website with readHTMLTable() from the XML
>> package:
>>
>> library(XML)
>>> moose = readHTMLTable("http://www.decisionmoose.com/Moosistory.html",
>>>
>> header=FALSE, skip.rows=c(1,2), trim=TRUE)[[1]]
>>
>>> moose
>>>
>> V1 V2 V3
>> 1 07.02.2010 SWITCH to Long Bonds\n (BTTRX) $880,370
>> 2 05.07.2010 Switch to Gold (GLD) $878,736
>> 3 03.05.2010 Switch to US Small-cap Equities (IWM) $895,676
>> 4 01.22.2010 Switch to Cash (3moT) $895,572
>> ..... truncated by me!
>>
>> I am interested in the values in the third column:
>>
>> as.character(moose$V3)
>>>
>> [1] "$880,370 " "$878,736 " "$895,676 " "$895,572 " "$932,139 "
>> "$932,131 " "$1,013,505 " "$817,451 " "$817,082 " "$848,133"
>> [11] "$904,527 " " $903,981 " "$902,582 " "$896,170 " "$809,853 "
>> "
>> $808,852 " " $807,409 " "$802,658 " "$747,629 " "$672,465 "
>> [21] " $671,826 " "$645,352 " "$615,174 " "$609,415 " " $590,664 "
>> "
>> $586,785 " "$561,056 " "$537,307 " " $535,744 " " $552,712 "
>> [31] "$551,615 " " $508,790 " "$501,161 " "$499,023 " " $446,568 "
>> "$423,727 " "$421,967 " "$396,007 " "$395,943 " " $270,011 "
>> [41] "$264,386 " "$278,513 " "$251,855 " "$251,685 " " $129,198 "
>> "$127,541 " "$117,381 " "$100,000 " " " " $275,417"
>> [51] "$266,459" " $214,552" "$207,312" "$173,557" "$167,647"
>> "$150,516" "$135,842" "$126,667" "$131,642" "$113,804"
>> [61] "$107,364" "$108,242" " $102,881" " $100,000"
>>
>> Notice the spaces leading and lagging some of the values.
>>
>> I want to get the values as numeric values, so I try to get rid of the
>> $-character and comma's with gsub() and a regular expression:
>>
>> gsub("[$,]", "", as.character(moose$V3))
>>>
>> [1] "880370 " "878736 " "895676 " "895572 " "932139 " "932131 "
>> "1013505 " "817451 " "817082 " "848133 " "904527 " " 903981 " "902582
>> "
>> [14] "896170 " "809853 " " 808852 " " 807409 " "802658 " "747629 "
>> "672465 " " 671826 " "645352 " "615174 " "609415 " " 590664 " "
>> 586785
>> "
>> [27] "561056 " "537307 " " 535744 " " 552712 " "551615 " " 508790 "
>> "501161 " "499023 " " 446568 " "423727 " "421967 " "396007 " "395943"
>> [40] " 270011 " "264386 " "278513 " "251855 " "251685 " " 129198 "
>> "127541 " "117381 " "100000 " " " " 275417" "266459" "
>> 214552"
>> [53] "207312" "173557" "167647" "150516" "135842" "126667"
>> "131642" "113804" "107364" "108242" " 102881" " 100000"
>>
>> Looks fine to me. Now I can use as.numeric() to convert to numbers
>> (leading
>> and lagging spaces should not be a problem):
>>
>> as.numeric(gsub("[$,]", "", as.character(moose$V3)))
>>>
>> [1] NA NA NA NA NA NA NA NA NA NA
>> NA NA NA NA NA NA NA NA NA NA
>> [21] NA NA NA NA NA NA NA NA NA NA
>> NA NA NA NA NA NA NA NA NA NA
>> [41] NA NA NA NA NA NA NA NA NA NA
>> 266459 NA 207312 173557 167647 150516 135842 126667 131642 113804
>> [61] 107364 108242 NA NA
>> Warning message:
>> NAs introduced by coercion
>>
>> Something is wrong here! Let's have a look at one specific value:
>>
>> gsub("[$,]", "", as.character(moose$V3))[1]
>>>
>> [1] "880370 "
>>
>>> as.numeric(gsub("[$,]", "", as.character(moose$V3))[1])
>>>
>> [1] NA
>> Warning message:
>> NAs introduced by coercion
>>
>> If the last character in the string would be a regular space it would not
>> be
>> a problem for as.numeric():
>>
>> as.numeric("880370 ")
>>>
>> [1] 880370
>>
>> But it looks like it's not a regular space character:
>>
>> substr(gsub("[$,]", "", as.character(moose$V3))[1], 7, 7) == " "
>>>
>> [1] FALSE
>>
>> It looks to me the spaces in some of the cells are not regular spaces. In
>> the original HTML table they are defined as "non breaking spaces" i.e.
>>
>>
>> So my question is WHAT ARE THEY?
>> Is there a way to show the binary (hex) values of these characters?
>>
>
> charToRaw(...) will show them
>
> gsub("[[:space:]]", "", ...) may remove them
>
>
> J. R. M. Hosking
>
>
>> Here is my environment:
>>
>> sessionInfo()
>>>
>> R version 2.11.1 (2010-05-31)
>> i486-pc-linux-gnu
>>
>> locale:
>> [1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
>> LC_TIME=en_US.utf8
>> LC_COLLATE=en_US.utf8 LC_MONETARY=C
>> [6] LC_MESSAGES=en_US.utf8 LC_PAPER=en_US.utf8 LC_NAME=C
>> LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] XML_3.1-0
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.11.1
>>
>> Thanks,
>>
>> -Mark-
>>
>> [[alternative HTML version deleted]]
>>
>>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.