Thank you. This was very informative. When I run this command (str(y)), I get something like:
$ WOMEN.X MEN 3 :'data.frame': 0 obs. of 5 variables: ..$ DayOfYear : int(0) ..$ Quantity : int(0) ..$ Fraction : num(0) ..$ Category : Factor w/ 46 levels "(Unknown)","10\" Plates",..: ..$ SubCategory: Factor w/ 246 levels "(Unknown)","70's Disco",..: What does the output mean 'Factor w/ 46 levels . . . .' or 'Factor w/ 246 levels . .? Thanks again. Kevin ---- jim holtman <[EMAIL PROTECTED]> wrote: > Is this something like what you were asking for? The output of a > 'split' will be a list of the dataframe subsets for the categories you > have specified. > > > x <- data.frame(g1=sample(LETTERS[1:2],30,TRUE), > + g2=sample(letters[1:2], 30, TRUE), > + g3=1:30) > > y <- split(x, list(x$g1, x$g2)) > > str(y) > List of 4 > $ A.a:'data.frame': 7 obs. of 3 variables: > ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 > ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 > ..$ g3: int [1:7] 3 4 6 8 9 13 24 > $ B.a:'data.frame': 7 obs. of 3 variables: > ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 > ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 > ..$ g3: int [1:7] 10 11 16 17 18 20 25 > $ A.b:'data.frame': 6 obs. of 3 variables: > ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 > ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2 > ..$ g3: int [1:6] 2 12 23 26 27 29 > $ B.b:'data.frame': 10 obs. of 3 variables: > ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 > ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2 2 2 2 2 > ..$ g3: int [1:10] 1 5 7 14 15 19 21 22 28 30 > > y > $A.a > g1 g2 g3 > 3 A a 3 > 4 A a 4 > 6 A a 6 > 8 A a 8 > 9 A a 9 > 13 A a 13 > 24 A a 24 > > $B.a > g1 g2 g3 > 10 B a 10 > 11 B a 11 > 16 B a 16 > 17 B a 17 > 18 B a 18 > 20 B a 20 > 25 B a 25 > > $A.b > g1 g2 g3 > 2 A b 2 > 12 A b 12 > 23 A b 23 > 26 A b 26 > 27 A b 27 > 29 A b 29 > > $B.b > g1 g2 g3 > 1 B b 1 > 5 B b 5 > 7 B b 7 > 14 B b 14 > 15 B b 15 > 19 B b 19 > 21 B b 21 > 22 B b 22 > 28 B b 28 > 30 B b 30 > > > y[[2]] > g1 g2 g3 > 10 B a 10 > 11 B a 11 > 16 B a 16 > 17 B a 17 > 18 B a 18 > 20 B a 20 > 25 B a 25 > > > > > > > > > On Sat, Jul 12, 2008 at 8:51 PM, <[EMAIL PROTECTED]> wrote: > > OK. Now I know that I am dealing with a data frame. One last question on > > this topic. a <- read.csv() gives me a dataframe. If I have 'c <- split(x, > > x$Category), then what is returned by split in this case? c[1] seems to be > > OK but c[2] is not right in my mind. If I run ci <- split(nrow(a), > > a$Category). And then ci[1] seems to be the rows associated with the first > > category, c[2] is the indices/rows associated with the second category, > > etc. But this seems different than c[1], c[2], etc. > > > > Using the techniques below I can get the information on the categories. Now > > as an extra level of complexity there are SubCategories within each > > Category. Assume that the SubCategory names are not unique within the > > dataset so if I want the SubCategory data I need to retrive the indices (or > > data) for the Category and SubCategory pair. In other words if I have a > > Category that ranges from 'A' to 'Z', it is possible that I might have a > > subcategory A a, A b (where a and b are the sub category names). I also > > might have B a, B b. I want all of the sub categories A a. NOT the > > subcategories a (because that might include B a which would be different). > > I am guessing that this will take more than a simple 'split'. > > > > Thank you. > > > > Kevin > > > > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote: > >> On 12/07/2008 3:59 PM, [EMAIL PROTECTED] wrote: > >> > I am sorry but if read.csv returns a dataframe and a dataframe is like a > >> > matrix and I have a set of input like below and a[1,] gives me the first > >> > row, what is the second index? From what I read and your input I am > >> > guessing that it is the column number. So a[1,1] would return the > >> > DayOfYear column for the first row, right? What does a$DayOfYear return? > >> > >> a$DayOfYear would be the same as a[,1] or a[,"DayOfYear"], i.e. it would > >> return the entire first column. > >> > >> Duncan Murdoch > >> > >> > > >> > Thank you for your patience. > >> > > >> > Kevin > >> > > >> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote: > >> >> On 12/07/2008 12:31 PM, [EMAIL PROTECTED] wrote: > >> >>> I am using a simple R statement to read in the file: > >> >>> > >> >>> a <- read.csv("Sample.dat", header=TRUE) > >> >>> > >> >>> There is alot of data but the first few lines look like: > >> >>> > >> >>> DayOfYear,Quantity,Fraction,Category,SubCategory > >> >>> 1,82,0.0000390392720794458,(Unknown),(Unknown) > >> >>> 2,78,0.0000371349173438631,(Unknown),(Unknown) > >> >>> . . . > >> >>> 71,2,0.0000009521773677913,WOMEN,Piratesses > >> >>> 72,4,0.0000019043547355827,WOMEN,Piratesses > >> >>> 73,3,0.0000014282660516870,WOMEN,Piratesses > >> >>> 74,14,0.0000066652415745395,WOMEN,Piratesses > >> >>> 75,2,0.0000009521773677913,WOMEN,Piratesses > >> >>> > >> >>> If I read the data in as above, the command > >> >>> > >> >>> a[1] > >> >>> > >> >>> results in the output > >> >>> > >> >>> [ reached getOption("max.print") -- omitted 16193 rows ]] > >> >>> > >> >>> Shouldn't this be the first row? > >> >> No, the first row would be a[1,]. read.csv() returns a dataframe, and > >> >> those are indexed with two indices to treat them like a matrix, or with > >> >> one index to treat them like a list of their columns. > >> >> > >> >> Duncan Murdoch > >> >> > >> >>> a$Category[1] > >> >>> > >> >>> results in the output > >> >>> > >> >>> [1] (Unknown) > >> >>> 4464 Levels: Tags ... WOMEN > >> >>> > >> >>> But > >> >>> > >> >>> a$Category[365] > >> >>> > >> >>> gives me: > >> >>> > >> >>> [1] 7 Plates (Dessert),Western\n120,5,0.0000023804434194784,7 Plates > >> >>> (Dessert) > >> >>> 4464 Levels: Tags ... WOMEN > >> >>> > >> >>> There is something fundamental about either vectors of the read.csv > >> >>> command that I am missing here. > >> >>> > >> >>> Thank you. > >> >>> > >> >>> Kevin > >> >>> > >> >>> ---- jim holtman <[EMAIL PROTECTED]> wrote: > >> >>>> Please provide commented, minimal, self-contained, reproducible code, > >> >>>> or at least a before/after of what you data would look like. Taking a > >> >>>> guess at what you are asking, here is one way of doing it: > >> >>>> > >> >>>> > >> >>>>> x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=1:20, b=runif(20)) > >> >>>>> x > >> >>>> cat a b > >> >>>> 1 B 1 0.65472393 > >> >>>> 2 C 2 0.35319727 > >> >>>> 3 B 3 0.27026015 > >> >>>> 4 A 4 0.99268406 > >> >>>> 5 C 5 0.63349326 > >> >>>> 6 A 6 0.21320814 > >> >>>> 7 C 7 0.12937235 > >> >>>> 8 A 8 0.47811803 > >> >>>> 9 A 9 0.92407447 > >> >>>> 10 A 10 0.59876097 > >> >>>> 11 A 11 0.97617069 > >> >>>> 12 A 12 0.73179251 > >> >>>> 13 B 13 0.35672691 > >> >>>> 14 C 14 0.43147369 > >> >>>> 15 C 15 0.14821156 > >> >>>> 16 C 16 0.01307758 > >> >>>> 17 B 17 0.71556607 > >> >>>> 18 B 18 0.10318424 > >> >>>> 19 C 19 0.44628435 > >> >>>> 20 B 20 0.64010105 > >> >>>>> # create a list of the indices of the data grouped by 'cat' > >> >>>>> split(seq(nrow(x)), x$cat) > >> >>>> $A > >> >>>> [1] 4 6 8 9 10 11 12 > >> >>>> > >> >>>> $B > >> >>>> [1] 1 3 13 17 18 20 > >> >>>> > >> >>>> $C > >> >>>> [1] 2 5 7 14 15 16 19 > >> >>>> > >> >>>>> # or do you want the data > >> >>>>> split(x, x$cat) > >> >>>> $A > >> >>>> cat a b > >> >>>> 4 A 4 0.9926841 > >> >>>> 6 A 6 0.2132081 > >> >>>> 8 A 8 0.4781180 > >> >>>> 9 A 9 0.9240745 > >> >>>> 10 A 10 0.5987610 > >> >>>> 11 A 11 0.9761707 > >> >>>> 12 A 12 0.7317925 > >> >>>> > >> >>>> $B > >> >>>> cat a b > >> >>>> 1 B 1 0.6547239 > >> >>>> 3 B 3 0.2702601 > >> >>>> 13 B 13 0.3567269 > >> >>>> 17 B 17 0.7155661 > >> >>>> 18 B 18 0.1031842 > >> >>>> 20 B 20 0.6401010 > >> >>>> > >> >>>> $C > >> >>>> cat a b > >> >>>> 2 C 2 0.35319727 > >> >>>> 5 C 5 0.63349326 > >> >>>> 7 C 7 0.12937235 > >> >>>> 14 C 14 0.43147369 > >> >>>> 15 C 15 0.14821156 > >> >>>> 16 C 16 0.01307758 > >> >>>> 19 C 19 0.44628435 > >> >>>> > >> >>>> > >> >>>> On Sat, Jul 12, 2008 at 3:32 AM, <[EMAIL PROTECTED]> wrote: > >> >>>>> I have search the archive and I could not find what I need so I will > >> >>>>> try to ask the question here. > >> >>>>> > >> >>>>> I read a table in (read.table) > >> >>>>> > >> >>>>> a <- read.table(.....) > >> >>>>> > >> >>>>> The table has column names like DayOfYear, Quantity, and Category. > >> >>>>> > >> >>>>> The values in the row for Category are strings (characters). > >> >>>>> > >> >>>>> I want to get all of the rows grouped by Category. The number of > >> >>>>> unique category names could be around 50. Say for argument sake the > >> >>>>> number of categories is exactly 50. Can I somehow get a vector of > >> >>>>> length 50 containing the rows corresponding to the category (another > >> >>>>> vector)? I realize I can access any row a[i]$Category (right?). But > >> >>>>> I wanta vector containing the rows corresponding to each distinct > >> >>>>> Category name. > >> >>>>> > >> >>>>> Thank you. > >> >>>>> > >> >>>>> Kevin > >> >>>>> > >> >>>>> ______________________________________________ > >> >>>>> R-help@r-project.org mailing list > >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >> >>>>> PLEASE do read the posting guide > >> >>>>> http://www.R-project.org/posting-guide.html > >> >>>>> and provide commented, minimal, self-contained, reproducible code. > >> >>>>> > >> >>>> > >> >>>> -- > >> >>>> Jim Holtman > >> >>>> Cincinnati, OH > >> >>>> +1 513 646 9390 > >> >>>> > >> >>>> What is the problem you are trying to solve? > >> >>> ______________________________________________ > >> >>> R-help@r-project.org mailing list > >> >>> https://stat.ethz.ch/mailman/listinfo/r-help > >> >>> PLEASE do read the posting guide > >> >>> http://www.R-project.org/posting-guide.html > >> >>> and provide commented, minimal, self-contained, reproducible code. > >> > > > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.