On Nov 29, 2011, at 1:18 PM, Rich Shepard wrote: > I have a data frame with 1 factor, one date, and 37 numeric values: > str(waterchem) > 'data.frame': 3525 obs. of 39 variables: > site : Factor w/ 64 levels "D-1","D-2","D-3",..: 1 1 1 1 1 ... > $ sampdate : Date, format: "2007-12-12" "2008-03-15" ... > $ CO3 : num 1 1 6.7 1 1 1 1 1 1 1 ... > $ HCO3 : num 231 228 118 246 157 208 338 285 260 240 ... > $ Ca : num 100 88.4 63.4 123 78.2 103 265 213 178 166 ... > $ DO : num 4.96 9.91 4.32 2.58 1.81 5.09 3.98 5.46 1.9 2.52 ... > ... > $ SC : Factor w/ 841 levels "1.090","10.000",..: 635 638 363 > > All the numeric categories are read in as numbers except for some of those > in column 'SC'. I have been looking in the source file for a couple of hours > trying to learn why values such as 1.090 and 10.000 are seen as characters > rather than numbers. I've not see the reason. > > The source file is 860K and looks like this: > > site|sampdate|'Ag'|'Al'|'CO3'|'HCO3'|'Alk-Tot'|'As'|'Ba'|'Be'|'Bi'|'Ca'|'Cd'|'Cl'|'Co'|'Cr'|'Cu'|'DO'|'Fe'|'Hg'|'K'|'Mg'|'Mn'|'Mo'|'Na'|'NH4'|'NO3-NO2'|'Oil-grease'|'Pb'|'pH'|'Sb'|'SC'|'Se'|'SO4'|'Sr'|'TDS'|'Tl'|'V'|'Zn' > 'D-1'|'2007-12-12'|0.000|0.106|1.000|231.000|231.000|0.011|0.000|0.002|0.000|100.000|0.000|1.430|0.000|0.006|0.024|4.960|4.110|NA|0.000|9.560|0.035|0.000|0.970|0.010|0.293|NA|0.025|7.800|0.001|630.000|0.001|65.800|0.000|320.000|0.001|0.000|11.400 > 'D-1'|'2008-03-15'|0.000|0.080|1.000|228.000|228.000|0.001|0.000|0.002|0.000|88.400|0.000|1.340|0.000|0.006|0.014|9.910|0.309|0.000|0.000|9.150|0.047|0.000|0.820|0.224|0.020|NA|0.025|7.940|0.001|633.000|0.001|75.400|0.000|300.000|0.001|0.000|12.400 > > The R command used to create the data frame is: > waterchem <- read.table('wqR.txt', header = TRUE, sep = '|') > > Pointers on how to determine why this one variable has some values and > characters rather than as numerics are needed. > > Rich
Rich, Somewhere in that column are non-numeric characters (other than 0 through 9 and a decimal point), resulting in the column being coerced to a factor. Not fully tested, but using grepl() along the lines of: Vec <- c(1.09, 1.23, "1,23", "A", 2.067) > which(grepl("[^0-9\\.]", Vec)) [1] 3 4 Will give you the indices of the entries in the column that contain non-numeric characters. > Vec[which(grepl("[^0-9\\.]", Vec))] [1] "1,23" "A" Will give you the entries themselves. The read.table() family of functions use type.convert() internally to do the data type coercions: > type.convert(Vec) [1] 1.09 1.23 1,23 A 2.067 Levels: 1,23 1.09 1.23 2.067 A So 'Vec' is coerced to a factor due to the non-numeric characters contained in the entries. HTH, Marc Schwartz ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.