My suggestion is to avoid converting the column to a factor until it is cleaned up the way you want it. There is also the forcats package, but I still prefer to work with character data for cleaning. The stringsAsFactors=FALSE argument to read.table and friends helps with this.
On November 16, 2018 8:16:22 AM PST, Michael Dewey <li...@dewey.myzen.co.uk> wrote: >Dear Bill > >When you do your step of replacing lower case l with upper case L the >level still stays in the factor even though it is empty. If that is a >nuisance x <- factor(x) will drop the unused levels. There are other >ways of doing this. > >Michael > >On 16/11/2018 15:38, Bill Poling wrote: >> Hello: >> >> I am running windows 10 -- R3.5.1 -- RStudio Version 1.1.456 >> >> I would like to know why when I replace a column value it still >appears in subsequent routines: >> >> My example: >> >> r1$B1 is a Factor: It is created from the first character of a list >of CPT codes, r1$CPT. >> >> head(r1$CPT, N= 25) >> [1] A4649 A4649 C9359 C1713 A0394 A0398 >> 903 Levels: 00000 00001 00140 00160 00670 00810 00940 01400 01470 >01961 01968 10160 11000 11012 11042 11043 11044 11045 11100 11101 11200 >11201 11401 11402 ... l8699 >> >> str(r1$CPT) >> Factor w/ 903 levels "00000","00001",..: 773 773 816 783 739 741 >743 739 739 741 ... >> >> >> And I want only those CPT's with leading alpha char in this column so >I set the numeric leading char to Z >> >> r1$B1 <- str_sub(r1$CPT,1,1) >> >> r1$B1 <- as.factor(r1$B1) #Redundant >> levels(r1$B1)[levels(r1$B1) %in% >c('1','2','3','4','5','6','7','8','9','0')] <- 'Z' >> >> When I check what I have done I find l & L >> >> unique(r1$B1) >> #[1] A C Z L G Q U J V E S l D P >> #Levels: Z A C D E G J l L P Q S U V >> >> So I change l to L >> r1$B1[r1$B1 == 'l'] <- 'L' >> >> When I check again I have l & L but l = 0 >> table(r1$B1) >> # Z A C D E G J l >L P Q S U V >> #19639 1673 546 2 8 147 281 0 664 1 64 > 36 114 14 >> >> When I go to find those rows as if they existed, they are not >accounted for? >> >> tmp <- subset(r1, B1 == "l") >> print(tmp) >> Empty data.table (0 rows) of 9 cols: >SavingsReversed,productID,ProviderID,PatientGender,ModCnt,Editnumber2... >> >> And I have actually visually inspected the whole darn column, sheesh! >> >> So I ignore it temporarily. >> >> Now later on it resurfaces in a tutorial I am following for caret >pkg. >> >> preProcess(r1b, method = c("center", "scale"), >> thresh = 0.95, pcaComp = NULL, na.remove = TRUE, k = 5, >> knnSummary = mean, outcome = NULL, fudge = 0.2, numUnique >= 3, >> verbose = FALSE, freqCut = 95/5, uniqueCut = 10, cutoff = >0.9, >> rangeBounds = c(0, 1)) >> # Warning in preProcess.default(r1b, method = c("center", "scale"), >thresh = 0.95, : >> # These variables have zero >variances: B1l <-------------yes this is a remnant of the r1$B1 >clean-up >> # Created from 23141 samples and 22 >variables >> # >> # Pre-processing: >> # - centered (22) >> # - ignored (0) >> # - scaled (22) >> >> >> So my questions are, in consideration of regression modelling >accuracy: >> >> Why is this happening? >> How do I remove it? >> Or is it irrelevant and leave it be? >> >> As always, thank you for you support. >> >> WHP >> >> >> >> >> >> >> >> >> >> >> >> >> Confidentiality Notice This message is sent from Zelis. >...{{dropped:13}} >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> -- Sent from my phone. Please excuse my brevity. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.