Hello,
To count the number of variables with less than 5 characters, use nchar
and table or aggregate.
Since nchar needs a character vector and you have a factor, first
convert with as.character.
edt1a$ProcedureCode <- as.character(edt1a$ProcedureCode)
1.
Now any of the next 3 instructions will table the vector by number of
characters.
table(nchar(edt1a$ProcedureCode))
aggregate(ProcedureCode ~ nchar(ProcedureCode), edt1a, length)
tapply(edt1a$ProcedureCode, nchar(edt1a$ProcedureCode), length)
2.
If you want to change the values with less than 5 chars or all NA's to
"99999", a vectorized logical operation is a good way of doing it.
n <- nchar(edt1a$ProcedureCode) < 5
na <- is.na(edt1a$ProcedureCode)
edt1a$ProcedureCode[n | na] <- "99999"
Now back to factor, with the new level "99999".
edt1a$ProcedureCode <- factor(edt1a$ProcedureCode)
Hope this helps,
Rui Barradas
Às 13:24 de 16/12/19, Bill Poling escreveu:
#RStudio Version 1.2.5019
sessionInfo()
# R version 3.6.1 (2019-07-05)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows 10 x64 (build 17134)
Good morning. I have a factor that contains 1,418,303 Clinical Procedure Code
(CPT).
A CPT Code is 5 char. However, among my data there are many values that are
less, 2, 3, 4, as well as NA's
I get the count of NA's from the str() function = 58,481
Using the nchar function (I converted the Factor to a character column first) I
get the first 1K values.
(Perhaps this is not necessary with an alternative function?)
# edt1a$ProcedureCode1 <- levels(edt1a$ProcedureCode)[edt1a$ProcedureCode]
#https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/nchar
[989] 5 5 5 5 5 5 5 5 5 5 5 5
[ reached getOption("max.print") -- omitted 1417303 entries ]
What I would like to do is:
1. Identify the number (Count) of values that are less than 5 char (i.e. 2 char
= 150, 3 char = 925, 4 char = 1002)
Probably look something like this:
|Var1 | Freq|
|:------|-----:|
|2 | 150 |
|3 | 925 |
|4 | 1002|
2. Replace with 99999 as well as replace the NA's with 99999
head(edt1a$ProcedureCode1, n= 50) #Not apparent in top 50 but they are there
[1] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479" "99479" "99479"
"99479" "97110" "J1885" "19081" "99479"
[20] "99478" "99479" "99479" "99479" "99213" "99213" "98927" "96372" "92507" "99479" "99478" "99478" "99478" "99479"
"77065" "19083" "95874" "99244" "A7034"
[39] "A7046" "71275" "J1170" "90471" "87591" "80053" "98926" "A4649" "A7033" "43644"
"85025" "73080"
str(edt1a$ProcedureCode) #Factor w/ 6244
Factor w/ 6244 levels "0003M","00100",..: 1775 4732 4732 4733 4586 147 4708
3108 2400 4732 ...
str(edt1a$ProcedureCode1)
chr [1:1418303] "44207" "99478" "99478" "99479" "98927" "01610" "99396" "81025" "64645" "99478" "99479" "99479"
"99479" "99479" "99479" "97110" "J1885" ...
#Some examples from using sink and knitr
sink("ProcCodeV2.txt")
knitr::kable(table(edt1a$ProcedureCode1))
closeAllConnections()
|Var1 | Freq|
|:------|-----:|
|0003M | 1|
|0110 | 4|<--
|0111 | 5|<--
|01112 | 11|
|0112 | 14|<--
|01120 | 3|
|0113 | 2|<--
|01130 | 1|
|0114 | 1|<--
|01160 | 3|
|01170 | 4|
|0120 | 7|<--
|01200 | 8|
|01202 | 26|
|0121 | 7|<--
|01210 | 19|
|01214 | 125|
|01215 | 5|
|0122 | 2|<--
|01220 | 2|
|01230 | 11|
|0124 | 5|<--
|171 | 1|<--
|17106 | 6|
Thank you for any help.
WHP
Confidentiality Notice\ \ This email and the attachments...{{dropped:11}}
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.