On Sun, Jul 13, 2008 at 5:45 PM,  <[EMAIL PROTECTED]> wrote:
> Thank you I will try drop=TRUE.
>
> In the mean time do you know how I can access the members (for lack of a 
> better term) of the results of a split? In the sample you provided below you 
> have:
>
> z <- split(x, list(x$cat, x$a), drop=TRUE)

You can do 'str(z)' to see the structure of 'z'.  In most cases, you
should be able to reference by the keys, if they exist:

> n <- 20
> set.seed(1)
> x <- data.frame(a=sample(LETTERS[1:2], n,TRUE), b=sample(letters[1:4], n, 
> TRUE), val=runif(n))
> z <- split(x, list(x$a, x$b), drop=TRUE)
> str(z)
List of 8
 $ A.a:'data.frame':    2 obs. of  3 variables:
  ..$ a  : Factor w/ 2 levels "A","B": 1 1
  ..$ b  : Factor w/ 4 levels "a","b","c","d": 1 1
  ..$ val: num [1:2] 0.647 0.245
 $ B.a:'data.frame':    3 obs. of  3 variables:
  ..$ a  : Factor w/ 2 levels "A","B": 2 2 2
  ..$ b  : Factor w/ 4 levels "a","b","c","d": 1 1 1
  ..$ val: num [1:3] 0.5530 0.0233 0.5186
 $ A.b:'data.frame':    3 obs. of  3 variables:
  ..$ a  : Factor w/ 2 levels "A","B": 1 1 1
  ..$ b  : Factor w/ 4 levels "a","b","c","d": 2 2 2
  ..$ val: num [1:3] 0.530 0.693 0.478
 $ B.b:'data.frame':    4 obs. of  3 variables:
  ..$ a  : Factor w/ 2 levels "A","B": 2 2 2 2
  ..$ b  : Factor w/ 4 levels "a","b","c","d": 2 2 2 2
  ..$ val: num [1:4] 0.789 0.477 0.438 0.407
 $ A.c:'data.frame':    3 obs. of  3 variables:
  ..$ a  : Factor w/ 2 levels "A","B": 1 1 1
  ..$ b  : Factor w/ 4 levels "a","b","c","d": 3 3 3
  ..$ val: num [1:3] 0.8612 0.0995 0.6620
 $ B.c:'data.frame':    1 obs. of  3 variables:
  ..$ a  : Factor w/ 2 levels "A","B": 2
  ..$ b  : Factor w/ 4 levels "a","b","c","d": 3
  ..$ val: num 0.783
 $ A.d:'data.frame':    1 obs. of  3 variables:
  ..$ a  : Factor w/ 2 levels "A","B": 1
  ..$ b  : Factor w/ 4 levels "a","b","c","d": 4
  ..$ val: num 0.821
 $ B.d:'data.frame':    3 obs. of  3 variables:
  ..$ a  : Factor w/ 2 levels "A","B": 2 2 2
  ..$ b  : Factor w/ 4 levels "a","b","c","d": 4 4 4
  ..$ val: num [1:3] 0.7323 0.0707 0.3163

Here are some examples of accessing the data:

> z$B.d
   a b        val
9  B d 0.73231374
15 B d 0.07067905
17 B d 0.31627171
> # or just the value (it is a vector)
> z$B.d$val
[1] 0.73231374 0.07067905 0.31627171
> # or by name
> z[["B.d"]]$val
[1] 0.73231374 0.07067905 0.31627171
> # or by absolute number
> z[[8]]$val
[1] 0.73231374 0.07067905 0.31627171
> # take the mean
> mean(z$B.d$val)
[1] 0.3730882
> # get the length
> length(z$B.d$val)
[1] 3
>



>
> Now I can print out 'z[1], z[2] etc' This is nice but what if I want the 
> access/iterate through all of the members of a particular column in z. You 
> have given some methods like z[[1]]$b to access the specific columns in z. I 
> notice for your example z[[1]]$b prints out two values. Can I assume that 
> z[[1]]$b is a vecotr? So if I want to find the mean i can 'mean(z[[1]]$b)' 
> and it will give me the mean value of the b columns in z? (similarily sum, 
> and range, etc.). Does nrows(z[[1]]$b) return two in your example below? I 
> would like to find out how many elements are in z[1]. Or would it be just as 
> fast to do 'nrows(z[1])'?
>
> Thank you for this extended session on data frames, matrices, and vectors. I 
> feel much more comfortable with the concepts now.
>
> Kevin
> ---- jim holtman <[EMAIL PROTECTED]> wrote:
>> The reason for the empty levels was I did not put drop=TRUE on the
>> split to remove unused levels.  Here is the revised script:
>>
>> > set.seed(1)  # start with a known number
>> > x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=sample(letters[1:4], 
>> > 20, TRUE), b=runif(20))
>> > x
>>    cat a          b
>> 1    A d 0.82094629
>> 2    B a 0.64706019
>> 3    B c 0.78293276
>> 4    C a 0.55303631
>> 5    A b 0.52971958
>> 6    C b 0.78935623
>> 7    C a 0.02333120
>> 8    B b 0.47723007
>> 9    B d 0.73231374
>> 10   A b 0.69273156
>> 11   A b 0.47761962
>> 12   A c 0.86120948
>> 13   C b 0.43809711
>> 14   B a 0.24479728
>> 15   C d 0.07067905
>> 16   B c 0.09946616
>> 17   C d 0.31627171
>> 18   C a 0.51863426
>> 19   B c 0.66200508
>> 20   C b 0.40683019
>> > # drop unused groups from the split
>> > (z <- split(x, list(x$cat, x$a), drop=TRUE))
>> $B.a
>>    cat a         b
>> 2    B a 0.6470602
>> 14   B a 0.2447973
>>
>> $C.a
>>    cat a          b
>> 4    C a 0.55303631
>> 7    C a 0.02333120
>> 18   C a 0.51863426
>>
>> $A.b
>>    cat a         b
>> 5    A b 0.5297196
>> 10   A b 0.6927316
>> 11   A b 0.4776196
>>
>> $B.b
>>   cat a         b
>> 8   B b 0.4772301
>>
>> $C.b
>>    cat a         b
>> 6    C b 0.7893562
>> 13   C b 0.4380971
>> 20   C b 0.4068302
>>
>> $A.c
>>    cat a         b
>> 12   A c 0.8612095
>>
>> $B.c
>>    cat a          b
>> 3    B c 0.78293276
>> 16   B c 0.09946616
>> 19   B c 0.66200508
>>
>> $A.d
>>   cat a         b
>> 1   A d 0.8209463
>>
>> $B.d
>>   cat a         b
>> 9   B d 0.7323137
>>
>> $C.d
>>    cat a          b
>> 15   C d 0.07067905
>> 17   C d 0.31627171
>>
>> > # access the value ('b' in this instance); two ways- should be the same
>> > z[[1]]$b
>> [1] 0.6470602 0.2447973
>> > z$B.a$b
>> [1] 0.6470602 0.2447973
>> >
>> >
>> >
>> >
>>
>>
>> On Sun, Jul 13, 2008 at 1:26 AM,  <[EMAIL PROTECTED]> wrote:
>> > This is almost it. Maybe it is as good as can be expected. The only 
>> > problem that I see is that this seems to form a Category/SubCategory pair 
>> > where none existed in the original data. For example, A might have two 
>> > sub-categories a and b, and B might have two categories c and d. As far as 
>> > I can tell the method that you outlined forms a Category/SubCategory pair 
>> > like B a or B b where none existed. This results in alot of empty lists 
>> > and it seems to take a long time to generate. But if that is as good as it 
>> > gets then I can live with it.
>> >
>> > I know that I said one more question. But I have run into a problem. c <- 
>> > split(x, x$Category) returns a vector of the rows in each of the 
>> > categories. Now I would like to access the "Quantity" column within this 
>> > split vector. I can see it listed. I just can't access it. I have tried 
>> > c[1]$Quantity and c[1,2] both which give me errors. Any ideas?
>> >
>> > Sorry this is so hard for me. I am more used to C type arrays and C type 
>> > arrays of structures. This seems to be somewhat different.
>> >
>> > Thank you.
>> >
>> > Kevin
>> > ---- jim holtman <[EMAIL PROTECTED]> wrote:
>> >> Is this something like what you were asking for?  The output of a
>> >> 'split' will be a list of the dataframe subsets for the categories you
>> >> have specified.
>> >>
>> >> > x <- data.frame(g1=sample(LETTERS[1:2],30,TRUE),
>> >> +     g2=sample(letters[1:2], 30, TRUE),
>> >> +     g3=1:30)
>> >> > y <- split(x, list(x$g1, x$g2))
>> >> > str(y)
>> >> List of 4
>> >>  $ A.a:'data.frame':    7 obs. of  3 variables:
>> >>   ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1
>> >>   ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1
>> >>   ..$ g3: int [1:7] 3 4 6 8 9 13 24
>> >>  $ B.a:'data.frame':    7 obs. of  3 variables:
>> >>   ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2
>> >>   ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1
>> >>   ..$ g3: int [1:7] 10 11 16 17 18 20 25
>> >>  $ A.b:'data.frame':    6 obs. of  3 variables:
>> >>   ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1
>> >>   ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2
>> >>   ..$ g3: int [1:6] 2 12 23 26 27 29
>> >>  $ B.b:'data.frame':    10 obs. of  3 variables:
>> >>   ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2
>> >>   ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2 2 2 2 2
>> >>   ..$ g3: int [1:10] 1 5 7 14 15 19 21 22 28 30
>> >> > y
>> >> $A.a
>> >>    g1 g2 g3
>> >> 3   A  a  3
>> >> 4   A  a  4
>> >> 6   A  a  6
>> >> 8   A  a  8
>> >> 9   A  a  9
>> >> 13  A  a 13
>> >> 24  A  a 24
>> >>
>> >> $B.a
>> >>    g1 g2 g3
>> >> 10  B  a 10
>> >> 11  B  a 11
>> >> 16  B  a 16
>> >> 17  B  a 17
>> >> 18  B  a 18
>> >> 20  B  a 20
>> >> 25  B  a 25
>> >>
>> >> $A.b
>> >>    g1 g2 g3
>> >> 2   A  b  2
>> >> 12  A  b 12
>> >> 23  A  b 23
>> >> 26  A  b 26
>> >> 27  A  b 27
>> >> 29  A  b 29
>> >>
>> >> $B.b
>> >>    g1 g2 g3
>> >> 1   B  b  1
>> >> 5   B  b  5
>> >> 7   B  b  7
>> >> 14  B  b 14
>> >> 15  B  b 15
>> >> 19  B  b 19
>> >> 21  B  b 21
>> >> 22  B  b 22
>> >> 28  B  b 28
>> >> 30  B  b 30
>> >>
>> >> > y[[2]]
>> >>    g1 g2 g3
>> >> 10  B  a 10
>> >> 11  B  a 11
>> >> 16  B  a 16
>> >> 17  B  a 17
>> >> 18  B  a 18
>> >> 20  B  a 20
>> >> 25  B  a 25
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >> On Sat, Jul 12, 2008 at 8:51 PM,  <[EMAIL PROTECTED]> wrote:
>> >> > OK. Now I know that I am dealing with a data frame. One last question 
>> >> > on this topic. a <- read.csv() gives me a dataframe. If I have 'c <- 
>> >> > split(x, x$Category), then what is  returned by split in this case? 
>> >> > c[1] seems to be OK but c[2] is not right in my mind. If I run ci <- 
>> >> > split(nrow(a), a$Category). And then ci[1] seems to be the rows 
>> >> > associated with the first category, c[2] is the indices/rows associated 
>> >> > with the second category, etc. But this seems different than c[1], 
>> >> > c[2], etc.
>> >> >
>> >> > Using the techniques below I can get the information on the categories. 
>> >> > Now as an extra level of complexity there are SubCategories within each 
>> >> > Category. Assume that the SubCategory names are not unique within the 
>> >> > dataset so if I want the SubCategory data I need to retrive the indices 
>> >> > (or data) for the Category and SubCategory pair. In other words if I 
>> >> > have a Category that ranges from 'A' to 'Z', it is possible that I 
>> >> > might have a subcategory A a, A b (where a and b are the sub category 
>> >> > names). I also might have B a, B b. I want all of the sub categories A 
>> >> > a. NOT the subcategories a (because that might include B a which would 
>> >> > be different). I am guessing that this will take more than a simple 
>> >> > 'split'.
>> >> >
>> >> > Thank you.
>> >> >
>> >> > Kevin
>> >> >
>> >> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote:
>> >> >> On 12/07/2008 3:59 PM, [EMAIL PROTECTED] wrote:
>> >> >> > I am sorry but if read.csv returns a dataframe and a dataframe is 
>> >> >> > like a matrix and I have a set of input like below and a[1,] gives 
>> >> >> > me the first row, what is the second index? From what I read and 
>> >> >> > your input I am guessing that it is the column number. So a[1,1] 
>> >> >> > would return the DayOfYear column for the first row, right? What 
>> >> >> > does a$DayOfYear return?
>> >> >>
>> >> >> a$DayOfYear would be the same as a[,1] or a[,"DayOfYear"], i.e. it 
>> >> >> would
>> >> >> return the entire first column.
>> >> >>
>> >> >> Duncan Murdoch
>> >> >>
>> >> >> >
>> >> >> > Thank you for your patience.
>> >> >> >
>> >> >> > Kevin
>> >> >> >
>> >> >> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote:
>> >> >> >> On 12/07/2008 12:31 PM, [EMAIL PROTECTED] wrote:
>> >> >> >>> I am using a simple R statement to read in the file:
>> >> >> >>>
>> >> >> >>> a <- read.csv("Sample.dat", header=TRUE)
>> >> >> >>>
>> >> >> >>> There is alot of data but the first few lines look like:
>> >> >> >>>
>> >> >> >>> DayOfYear,Quantity,Fraction,Category,SubCategory
>> >> >> >>> 1,82,0.0000390392720794458,(Unknown),(Unknown)
>> >> >> >>> 2,78,0.0000371349173438631,(Unknown),(Unknown)
>> >> >> >>> . . .
>> >> >> >>> 71,2,0.0000009521773677913,WOMEN,Piratesses
>> >> >> >>> 72,4,0.0000019043547355827,WOMEN,Piratesses
>> >> >> >>> 73,3,0.0000014282660516870,WOMEN,Piratesses
>> >> >> >>> 74,14,0.0000066652415745395,WOMEN,Piratesses
>> >> >> >>> 75,2,0.0000009521773677913,WOMEN,Piratesses
>> >> >> >>>
>> >> >> >>> If I read the data in as above, the command
>> >> >> >>>
>> >> >> >>> a[1]
>> >> >> >>>
>> >> >> >>> results in the output
>> >> >> >>>
>> >> >> >>> [ reached getOption("max.print") -- omitted 16193 rows ]]
>> >> >> >>>
>> >> >> >>> Shouldn't this be the first row?
>> >> >> >> No, the first row would be a[1,].  read.csv() returns a dataframe, 
>> >> >> >> and
>> >> >> >> those are indexed with two indices to treat them like a matrix, or 
>> >> >> >> with
>> >> >> >> one index to treat them like a list of their columns.
>> >> >> >>
>> >> >> >> Duncan Murdoch
>> >> >> >>
>> >> >> >>> a$Category[1]
>> >> >> >>>
>> >> >> >>> results in the output
>> >> >> >>>
>> >> >> >>> [1] (Unknown)
>> >> >> >>> 4464 Levels:   Tags ... WOMEN
>> >> >> >>>
>> >> >> >>> But
>> >> >> >>>
>> >> >> >>> a$Category[365]
>> >> >> >>>
>> >> >> >>> gives me:
>> >> >> >>>
>> >> >> >>> [1] 7 Plates   (Dessert),Western\n120,5,0.0000023804434194784,7 
>> >> >> >>> Plates   (Dessert)
>> >> >> >>> 4464 Levels:   Tags ... WOMEN
>> >> >> >>>
>> >> >> >>> There is something fundamental about either vectors of the 
>> >> >> >>> read.csv command that I am missing here.
>> >> >> >>>
>> >> >> >>> Thank you.
>> >> >> >>>
>> >> >> >>> Kevin
>> >> >> >>>
>> >> >> >>> ---- jim holtman <[EMAIL PROTECTED]> wrote:
>> >> >> >>>> Please provide commented, minimal, self-contained, reproducible 
>> >> >> >>>> code,
>> >> >> >>>> or at least a before/after of what you data would look like.  
>> >> >> >>>> Taking a
>> >> >> >>>> guess at what you are asking, here is one way of doing it:
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>> x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=1:20, 
>> >> >> >>>>> b=runif(20))
>> >> >> >>>>> x
>> >> >> >>>>    cat  a          b
>> >> >> >>>> 1    B  1 0.65472393
>> >> >> >>>> 2    C  2 0.35319727
>> >> >> >>>> 3    B  3 0.27026015
>> >> >> >>>> 4    A  4 0.99268406
>> >> >> >>>> 5    C  5 0.63349326
>> >> >> >>>> 6    A  6 0.21320814
>> >> >> >>>> 7    C  7 0.12937235
>> >> >> >>>> 8    A  8 0.47811803
>> >> >> >>>> 9    A  9 0.92407447
>> >> >> >>>> 10   A 10 0.59876097
>> >> >> >>>> 11   A 11 0.97617069
>> >> >> >>>> 12   A 12 0.73179251
>> >> >> >>>> 13   B 13 0.35672691
>> >> >> >>>> 14   C 14 0.43147369
>> >> >> >>>> 15   C 15 0.14821156
>> >> >> >>>> 16   C 16 0.01307758
>> >> >> >>>> 17   B 17 0.71556607
>> >> >> >>>> 18   B 18 0.10318424
>> >> >> >>>> 19   C 19 0.44628435
>> >> >> >>>> 20   B 20 0.64010105
>> >> >> >>>>> # create a list of the indices of the data grouped by 'cat'
>> >> >> >>>>> split(seq(nrow(x)), x$cat)
>> >> >> >>>> $A
>> >> >> >>>> [1]  4  6  8  9 10 11 12
>> >> >> >>>>
>> >> >> >>>> $B
>> >> >> >>>> [1]  1  3 13 17 18 20
>> >> >> >>>>
>> >> >> >>>> $C
>> >> >> >>>> [1]  2  5  7 14 15 16 19
>> >> >> >>>>
>> >> >> >>>>> # or do you want the data
>> >> >> >>>>> split(x, x$cat)
>> >> >> >>>> $A
>> >> >> >>>>    cat  a         b
>> >> >> >>>> 4    A  4 0.9926841
>> >> >> >>>> 6    A  6 0.2132081
>> >> >> >>>> 8    A  8 0.4781180
>> >> >> >>>> 9    A  9 0.9240745
>> >> >> >>>> 10   A 10 0.5987610
>> >> >> >>>> 11   A 11 0.9761707
>> >> >> >>>> 12   A 12 0.7317925
>> >> >> >>>>
>> >> >> >>>> $B
>> >> >> >>>>    cat  a         b
>> >> >> >>>> 1    B  1 0.6547239
>> >> >> >>>> 3    B  3 0.2702601
>> >> >> >>>> 13   B 13 0.3567269
>> >> >> >>>> 17   B 17 0.7155661
>> >> >> >>>> 18   B 18 0.1031842
>> >> >> >>>> 20   B 20 0.6401010
>> >> >> >>>>
>> >> >> >>>> $C
>> >> >> >>>>    cat  a          b
>> >> >> >>>> 2    C  2 0.35319727
>> >> >> >>>> 5    C  5 0.63349326
>> >> >> >>>> 7    C  7 0.12937235
>> >> >> >>>> 14   C 14 0.43147369
>> >> >> >>>> 15   C 15 0.14821156
>> >> >> >>>> 16   C 16 0.01307758
>> >> >> >>>> 19   C 19 0.44628435
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> On Sat, Jul 12, 2008 at 3:32 AM,  <[EMAIL PROTECTED]> wrote:
>> >> >> >>>>> I have search the archive and I could not find what I need so I 
>> >> >> >>>>> will try to ask the question here.
>> >> >> >>>>>
>> >> >> >>>>> I read a table in (read.table)
>> >> >> >>>>>
>> >> >> >>>>> a <- read.table(.....)
>> >> >> >>>>>
>> >> >> >>>>> The table has column names like DayOfYear, Quantity, and 
>> >> >> >>>>> Category.
>> >> >> >>>>>
>> >> >> >>>>> The values in the row for Category are strings (characters).
>> >> >> >>>>>
>> >> >> >>>>> I want to get all of the rows grouped by Category. The number of 
>> >> >> >>>>> unique category names could be around 50. Say for argument sake 
>> >> >> >>>>> the number of categories is exactly 50. Can I somehow get a 
>> >> >> >>>>> vector of length 50 containing the rows corresponding to the 
>> >> >> >>>>> category (another vector)? I realize I can access any row 
>> >> >> >>>>> a[i]$Category (right?). But I wanta vector containing the rows 
>> >> >> >>>>> corresponding to each distinct Category name.
>> >> >> >>>>>
>> >> >> >>>>> Thank you.
>> >> >> >>>>>
>> >> >> >>>>> Kevin
>> >> >> >>>>>
>> >> >> >>>>> ______________________________________________
>> >> >> >>>>> R-help@r-project.org mailing list
>> >> >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >> >>>>> PLEASE do read the posting guide 
>> >> >> >>>>> http://www.R-project.org/posting-guide.html
>> >> >> >>>>> and provide commented, minimal, self-contained, reproducible 
>> >> >> >>>>> code.
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>> --
>> >> >> >>>> Jim Holtman
>> >> >> >>>> Cincinnati, OH
>> >> >> >>>> +1 513 646 9390
>> >> >> >>>>
>> >> >> >>>> What is the problem you are trying to solve?
>> >> >> >>> ______________________________________________
>> >> >> >>> R-help@r-project.org mailing list
>> >> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >> >>> PLEASE do read the posting guide 
>> >> >> >>> http://www.R-project.org/posting-guide.html
>> >> >> >>> and provide commented, minimal, self-contained, reproducible code.
>> >> >>
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jim Holtman
>> >> Cincinnati, OH
>> >> +1 513 646 9390
>> >>
>> >> What is the problem you are trying to solve?
>> >
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem you are trying to solve?
>
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to