Hello J.R.M. Hosking,

charToRaw() works perfectly, thank you:

> charToRaw(as.character(moose[1, "V3"]))
 [1] 24 38 38 30 2c 33 37 30 c2 a0

gsub("[[:space:]]", "", ...) did not remove them, but now I know what they
are (hex: c2 a0) I can remove them with gsub() by:

> gsub("[$,\xc2\xa0]", "", as.character(moose[1, "V3"]))
[1] "880370"

Kind regards,

-Mark-


2010/8/24 J. R. M. Hosking <jrmh...@gmail.com>

> On 2010-08-23 11:03, Mark Breman wrote:
>
>> Hello everyone,
>>
>> I am reading a HTML table from a website with readHTMLTable() from the XML
>> package:
>>
>>  library(XML)
>>> moose = readHTMLTable("http://www.decisionmoose.com/Moosistory.html";,
>>>
>> header=FALSE, skip.rows=c(1,2), trim=TRUE)[[1]]
>>
>>> moose
>>>
>>             V1                                         V2          V3
>> 1   07.02.2010  SWITCH to Long Bonds\n            (BTTRX)   $880,370
>> 2   05.07.2010                       Switch to Gold (GLD)   $878,736
>> 3   03.05.2010      Switch to US Small-cap Equities (IWM)   $895,676
>> 4   01.22.2010                      Switch to Cash (3moT)   $895,572
>> ..... truncated by me!
>>
>> I am interested in the values in the third column:
>>
>>  as.character(moose$V3)
>>>
>>  [1] "$880,370 "   "$878,736 "   "$895,676 "   "$895,572 "   "$932,139 "
>> "$932,131 "   "$1,013,505 " "$817,451 "   "$817,082 "   "$848,133"
>> [11] "$904,527 "   " $903,981 "  "$902,582 "   "$896,170 "   "$809,853 "
>> "
>> $808,852 "  " $807,409 "  "$802,658 "   "$747,629 "   "$672,465 "
>> [21] " $671,826 "  "$645,352 "   "$615,174 "   "$609,415 "   " $590,664 "
>>  "
>> $586,785 "  "$561,056 "   "$537,307 "   " $535,744 "  " $552,712 "
>> [31] "$551,615 "   " $508,790 "  "$501,161 "   "$499,023 "   " $446,568 "
>>  "$423,727 "   "$421,967 "   "$396,007 "   "$395,943 "   " $270,011 "
>> [41] "$264,386 "   "$278,513 "   "$251,855 "   "$251,685 "   " $129,198 "
>>  "$127,541 "   "$117,381 "   "$100,000 "   " "           " $275,417"
>> [51] "$266,459"    " $214,552"   "$207,312"    "$173,557"    "$167,647"
>>  "$150,516"    "$135,842"    "$126,667"    "$131,642"    "$113,804"
>> [61] "$107,364"    "$108,242"    " $102,881"   " $100,000"
>>
>> Notice the spaces leading and lagging some of the values.
>>
>> I want to get the values as numeric values, so I try to get rid of the
>> $-character and comma's with gsub() and a regular expression:
>>
>>  gsub("[$,]", "", as.character(moose$V3))
>>>
>>  [1] "880370 "  "878736 "  "895676 "  "895572 "  "932139 "  "932131 "
>>  "1013505 " "817451 "  "817082 "  "848133 "  "904527 "  " 903981 " "902582
>> "
>> [14] "896170 "  "809853 "  " 808852 " " 807409 " "802658 "  "747629 "
>>  "672465 "  " 671826 " "645352 "  "615174 "  "609415 "  " 590664 " "
>> 586785
>> "
>> [27] "561056 "  "537307 "  " 535744 " " 552712 " "551615 "  " 508790 "
>> "501161 "  "499023 "  " 446568 " "423727 "  "421967 "  "396007 "  "395943"
>> [40] " 270011 " "264386 "  "278513 "  "251855 "  "251685 "  " 129198 "
>> "127541 "  "117381 "  "100000 "  " "        " 275417"  "266459"   "
>> 214552"
>> [53] "207312"   "173557"   "167647"   "150516"   "135842"   "126667"
>> "131642"   "113804"   "107364"   "108242"   " 102881"  " 100000"
>>
>> Looks fine to me. Now I can use as.numeric() to convert to numbers
>> (leading
>> and lagging spaces should not be a problem):
>>
>>  as.numeric(gsub("[$,]", "", as.character(moose$V3)))
>>>
>>  [1]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
>>   NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
>> [21]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
>>   NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
>> [41]     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
>> 266459     NA 207312 173557 167647 150516 135842 126667 131642 113804
>> [61] 107364 108242     NA     NA
>> Warning message:
>> NAs introduced by coercion
>>
>> Something is wrong here! Let's have a look at one specific value:
>>
>>  gsub("[$,]", "", as.character(moose$V3))[1]
>>>
>> [1] "880370 "
>>
>>> as.numeric(gsub("[$,]", "", as.character(moose$V3))[1])
>>>
>> [1] NA
>> Warning message:
>> NAs introduced by coercion
>>
>> If the last character in the string would be a regular space it would not
>> be
>> a problem for as.numeric():
>>
>>  as.numeric("880370 ")
>>>
>> [1] 880370
>>
>> But it looks like it's not a regular space character:
>>
>>  substr(gsub("[$,]", "", as.character(moose$V3))[1], 7, 7) == " "
>>>
>> [1] FALSE
>>
>> It looks to me the spaces in some of the cells are not regular spaces. In
>> the original HTML table they are defined as "non breaking spaces" i.e.
>> &nbsp;
>>
>> So my question is WHAT ARE THEY?
>> Is there a way to show the binary (hex) values of these characters?
>>
>
> charToRaw(...)  will show them
>
> gsub("[[:space:]]", "", ...)  may remove them
>
>
> J. R. M. Hosking
>
>
>> Here is my environment:
>>
>>  sessionInfo()
>>>
>> R version 2.11.1 (2010-05-31)
>> i486-pc-linux-gnu
>>
>> locale:
>>  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
>>  LC_TIME=en_US.utf8
>>        LC_COLLATE=en_US.utf8     LC_MONETARY=C
>>  [6] LC_MESSAGES=en_US.utf8    LC_PAPER=en_US.utf8       LC_NAME=C
>>       LC_ADDRESS=C              LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] XML_3.1-0
>>
>> loaded via a namespace (and not attached):
>> [1] tools_2.11.1
>>
>> Thanks,
>>
>> -Mark-
>>
>>        [[alternative HTML version deleted]]
>>
>>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to