Dear Bill
When you do your step of replacing lower case l with upper case L the
level still stays in the factor even though it is empty. If that is a
nuisance x <- factor(x) will drop the unused levels. There are other
ways of doing this.
Michael
On 16/11/2018 15:38, Bill Poling wrote:
Hello:
I am running windows 10 -- R3.5.1 -- RStudio Version 1.1.456
I would like to know why when I replace a column value it still appears in
subsequent routines:
My example:
r1$B1 is a Factor: It is created from the first character of a list of CPT
codes, r1$CPT.
head(r1$CPT, N= 25)
[1] A4649 A4649 C9359 C1713 A0394 A0398
903 Levels: 00000 00001 00140 00160 00670 00810 00940 01400 01470 01961 01968
10160 11000 11012 11042 11043 11044 11045 11100 11101 11200 11201 11401 11402
... l8699
str(r1$CPT)
Factor w/ 903 levels "00000","00001",..: 773 773 816 783 739 741 743 739 739
741 ...
And I want only those CPT's with leading alpha char in this column so I set the
numeric leading char to Z
r1$B1 <- str_sub(r1$CPT,1,1)
r1$B1 <- as.factor(r1$B1) #Redundant
levels(r1$B1)[levels(r1$B1) %in% c('1','2','3','4','5','6','7','8','9','0')]
<- 'Z'
When I check what I have done I find l & L
unique(r1$B1)
#[1] A C Z L G Q U J V E S l D P
#Levels: Z A C D E G J l L P Q S U V
So I change l to L
r1$B1[r1$B1 == 'l'] <- 'L'
When I check again I have l & L but l = 0
table(r1$B1)
# Z A C D E G J l L
P Q S U V
#19639 1673 546 2 8 147 281 0 664 1 64 36 114
14
When I go to find those rows as if they existed, they are not accounted for?
tmp <- subset(r1, B1 == "l")
print(tmp)
Empty data.table (0 rows) of 9 cols:
SavingsReversed,productID,ProviderID,PatientGender,ModCnt,Editnumber2...
And I have actually visually inspected the whole darn column, sheesh!
So I ignore it temporarily.
Now later on it resurfaces in a tutorial I am following for caret pkg.
preProcess(r1b, method = c("center", "scale"),
thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5,
knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique = 3,
verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff = 0.9,
rangeBounds = c(0, 1))
# Warning in preProcess.default(r1b, method = c("center", "scale"), thresh =
0.95, :
# These variables have zero variances: B1l
<-------------yes this is a remnant of the r1$B1 clean-up
# Created from 23141 samples and 22 variables
#
# Pre-processing:
# - centered (22)
# - ignored (0)
# - scaled (22)
So my questions are, in consideration of regression modelling accuracy:
Why is this happening?
How do I remove it?
Or is it irrelevant and leave it be?
As always, thank you for you support.
WHP
Confidentiality Notice This message is sent from Zelis. ...{{dropped:13}}
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Michael
http://www.dewey.myzen.co.uk/home.html
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.