Thank you I will try drop=TRUE.
In the mean time do you know how I can access the members (for lack of a better
term) of the results of a split? In the sample you provided below you have:
z <- split(x, list(x$cat, x$a), drop=TRUE)
Now I can print out 'z[1], z[2] etc' This is nice but what if I want the
access/iterate through all of the members of a particular column in z. You have
given some methods like z[[1]]$b to access the specific columns in z. I notice
for your example z[[1]]$b prints out two values. Can I assume that z[[1]]$b is
a vecotr? So if I want to find the mean i can 'mean(z[[1]]$b)' and it will give
me the mean value of the b columns in z? (similarily sum, and range, etc.).
Does nrows(z[[1]]$b) return two in your example below? I would like to find out
how many elements are in z[1]. Or would it be just as fast to do 'nrows(z[1])'?
Thank you for this extended session on data frames, matrices, and vectors. I
feel much more comfortable with the concepts now.
Kevin
---- jim holtman <[EMAIL PROTECTED]> wrote:
> The reason for the empty levels was I did not put drop=TRUE on the
> split to remove unused levels. Here is the revised script:
>
> > set.seed(1) # start with a known number
> > x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=sample(letters[1:4], 20,
> > TRUE), b=runif(20))
> > x
> cat a b
> 1 A d 0.82094629
> 2 B a 0.64706019
> 3 B c 0.78293276
> 4 C a 0.55303631
> 5 A b 0.52971958
> 6 C b 0.78935623
> 7 C a 0.02333120
> 8 B b 0.47723007
> 9 B d 0.73231374
> 10 A b 0.69273156
> 11 A b 0.47761962
> 12 A c 0.86120948
> 13 C b 0.43809711
> 14 B a 0.24479728
> 15 C d 0.07067905
> 16 B c 0.09946616
> 17 C d 0.31627171
> 18 C a 0.51863426
> 19 B c 0.66200508
> 20 C b 0.40683019
> > # drop unused groups from the split
> > (z <- split(x, list(x$cat, x$a), drop=TRUE))
> $B.a
> cat a b
> 2 B a 0.6470602
> 14 B a 0.2447973
>
> $C.a
> cat a b
> 4 C a 0.55303631
> 7 C a 0.02333120
> 18 C a 0.51863426
>
> $A.b
> cat a b
> 5 A b 0.5297196
> 10 A b 0.6927316
> 11 A b 0.4776196
>
> $B.b
> cat a b
> 8 B b 0.4772301
>
> $C.b
> cat a b
> 6 C b 0.7893562
> 13 C b 0.4380971
> 20 C b 0.4068302
>
> $A.c
> cat a b
> 12 A c 0.8612095
>
> $B.c
> cat a b
> 3 B c 0.78293276
> 16 B c 0.09946616
> 19 B c 0.66200508
>
> $A.d
> cat a b
> 1 A d 0.8209463
>
> $B.d
> cat a b
> 9 B d 0.7323137
>
> $C.d
> cat a b
> 15 C d 0.07067905
> 17 C d 0.31627171
>
> > # access the value ('b' in this instance); two ways- should be the same
> > z[[1]]$b
> [1] 0.6470602 0.2447973
> > z$B.a$b
> [1] 0.6470602 0.2447973
> >
> >
> >
> >
>
>
> On Sun, Jul 13, 2008 at 1:26 AM, <[EMAIL PROTECTED]> wrote:
> > This is almost it. Maybe it is as good as can be expected. The only problem
> > that I see is that this seems to form a Category/SubCategory pair where
> > none existed in the original data. For example, A might have two
> > sub-categories a and b, and B might have two categories c and d. As far as
> > I can tell the method that you outlined forms a Category/SubCategory pair
> > like B a or B b where none existed. This results in alot of empty lists and
> > it seems to take a long time to generate. But if that is as good as it gets
> > then I can live with it.
> >
> > I know that I said one more question. But I have run into a problem. c <-
> > split(x, x$Category) returns a vector of the rows in each of the
> > categories. Now I would like to access the "Quantity" column within this
> > split vector. I can see it listed. I just can't access it. I have tried
> > c[1]$Quantity and c[1,2] both which give me errors. Any ideas?
> >
> > Sorry this is so hard for me. I am more used to C type arrays and C type
> > arrays of structures. This seems to be somewhat different.
> >
> > Thank you.
> >
> > Kevin
> > ---- jim holtman <[EMAIL PROTECTED]> wrote:
> >> Is this something like what you were asking for? The output of a
> >> 'split' will be a list of the dataframe subsets for the categories you
> >> have specified.
> >>
> >> > x <- data.frame(g1=sample(LETTERS[1:2],30,TRUE),
> >> + g2=sample(letters[1:2], 30, TRUE),
> >> + g3=1:30)
> >> > y <- split(x, list(x$g1, x$g2))
> >> > str(y)
> >> List of 4
> >> $ A.a:'data.frame': 7 obs. of 3 variables:
> >> ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1
> >> ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1
> >> ..$ g3: int [1:7] 3 4 6 8 9 13 24
> >> $ B.a:'data.frame': 7 obs. of 3 variables:
> >> ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2
> >> ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1
> >> ..$ g3: int [1:7] 10 11 16 17 18 20 25
> >> $ A.b:'data.frame': 6 obs. of 3 variables:
> >> ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1
> >> ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2
> >> ..$ g3: int [1:6] 2 12 23 26 27 29
> >> $ B.b:'data.frame': 10 obs. of 3 variables:
> >> ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2
> >> ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2 2 2 2 2
> >> ..$ g3: int [1:10] 1 5 7 14 15 19 21 22 28 30
> >> > y
> >> $A.a
> >> g1 g2 g3
> >> 3 A a 3
> >> 4 A a 4
> >> 6 A a 6
> >> 8 A a 8
> >> 9 A a 9
> >> 13 A a 13
> >> 24 A a 24
> >>
> >> $B.a
> >> g1 g2 g3
> >> 10 B a 10
> >> 11 B a 11
> >> 16 B a 16
> >> 17 B a 17
> >> 18 B a 18
> >> 20 B a 20
> >> 25 B a 25
> >>
> >> $A.b
> >> g1 g2 g3
> >> 2 A b 2
> >> 12 A b 12
> >> 23 A b 23
> >> 26 A b 26
> >> 27 A b 27
> >> 29 A b 29
> >>
> >> $B.b
> >> g1 g2 g3
> >> 1 B b 1
> >> 5 B b 5
> >> 7 B b 7
> >> 14 B b 14
> >> 15 B b 15
> >> 19 B b 19
> >> 21 B b 21
> >> 22 B b 22
> >> 28 B b 28
> >> 30 B b 30
> >>
> >> > y[[2]]
> >> g1 g2 g3
> >> 10 B a 10
> >> 11 B a 11
> >> 16 B a 16
> >> 17 B a 17
> >> 18 B a 18
> >> 20 B a 20
> >> 25 B a 25
> >> >
> >> >
> >> >
> >>
> >>
> >> On Sat, Jul 12, 2008 at 8:51 PM, <[EMAIL PROTECTED]> wrote:
> >> > OK. Now I know that I am dealing with a data frame. One last question on
> >> > this topic. a <- read.csv() gives me a dataframe. If I have 'c <-
> >> > split(x, x$Category), then what is returned by split in this case? c[1]
> >> > seems to be OK but c[2] is not right in my mind. If I run ci <-
> >> > split(nrow(a), a$Category). And then ci[1] seems to be the rows
> >> > associated with the first category, c[2] is the indices/rows associated
> >> > with the second category, etc. But this seems different than c[1], c[2],
> >> > etc.
> >> >
> >> > Using the techniques below I can get the information on the categories.
> >> > Now as an extra level of complexity there are SubCategories within each
> >> > Category. Assume that the SubCategory names are not unique within the
> >> > dataset so if I want the SubCategory data I need to retrive the indices
> >> > (or data) for the Category and SubCategory pair. In other words if I
> >> > have a Category that ranges from 'A' to 'Z', it is possible that I might
> >> > have a subcategory A a, A b (where a and b are the sub category names).
> >> > I also might have B a, B b. I want all of the sub categories A a. NOT
> >> > the subcategories a (because that might include B a which would be
> >> > different). I am guessing that this will take more than a simple 'split'.
> >> >
> >> > Thank you.
> >> >
> >> > Kevin
> >> >
> >> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote:
> >> >> On 12/07/2008 3:59 PM, [EMAIL PROTECTED] wrote:
> >> >> > I am sorry but if read.csv returns a dataframe and a dataframe is
> >> >> > like a matrix and I have a set of input like below and a[1,] gives me
> >> >> > the first row, what is the second index? From what I read and your
> >> >> > input I am guessing that it is the column number. So a[1,1] would
> >> >> > return the DayOfYear column for the first row, right? What does
> >> >> > a$DayOfYear return?
> >> >>
> >> >> a$DayOfYear would be the same as a[,1] or a[,"DayOfYear"], i.e. it would
> >> >> return the entire first column.
> >> >>
> >> >> Duncan Murdoch
> >> >>
> >> >> >
> >> >> > Thank you for your patience.
> >> >> >
> >> >> > Kevin
> >> >> >
> >> >> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote:
> >> >> >> On 12/07/2008 12:31 PM, [EMAIL PROTECTED] wrote:
> >> >> >>> I am using a simple R statement to read in the file:
> >> >> >>>
> >> >> >>> a <- read.csv("Sample.dat", header=TRUE)
> >> >> >>>
> >> >> >>> There is alot of data but the first few lines look like:
> >> >> >>>
> >> >> >>> DayOfYear,Quantity,Fraction,Category,SubCategory
> >> >> >>> 1,82,0.0000390392720794458,(Unknown),(Unknown)
> >> >> >>> 2,78,0.0000371349173438631,(Unknown),(Unknown)
> >> >> >>> . . .
> >> >> >>> 71,2,0.0000009521773677913,WOMEN,Piratesses
> >> >> >>> 72,4,0.0000019043547355827,WOMEN,Piratesses
> >> >> >>> 73,3,0.0000014282660516870,WOMEN,Piratesses
> >> >> >>> 74,14,0.0000066652415745395,WOMEN,Piratesses
> >> >> >>> 75,2,0.0000009521773677913,WOMEN,Piratesses
> >> >> >>>
> >> >> >>> If I read the data in as above, the command
> >> >> >>>
> >> >> >>> a[1]
> >> >> >>>
> >> >> >>> results in the output
> >> >> >>>
> >> >> >>> [ reached getOption("max.print") -- omitted 16193 rows ]]
> >> >> >>>
> >> >> >>> Shouldn't this be the first row?
> >> >> >> No, the first row would be a[1,]. read.csv() returns a dataframe,
> >> >> >> and
> >> >> >> those are indexed with two indices to treat them like a matrix, or
> >> >> >> with
> >> >> >> one index to treat them like a list of their columns.
> >> >> >>
> >> >> >> Duncan Murdoch
> >> >> >>
> >> >> >>> a$Category[1]
> >> >> >>>
> >> >> >>> results in the output
> >> >> >>>
> >> >> >>> [1] (Unknown)
> >> >> >>> 4464 Levels: Tags ... WOMEN
> >> >> >>>
> >> >> >>> But
> >> >> >>>
> >> >> >>> a$Category[365]
> >> >> >>>
> >> >> >>> gives me:
> >> >> >>>
> >> >> >>> [1] 7 Plates (Dessert),Western\n120,5,0.0000023804434194784,7
> >> >> >>> Plates (Dessert)
> >> >> >>> 4464 Levels: Tags ... WOMEN
> >> >> >>>
> >> >> >>> There is something fundamental about either vectors of the read.csv
> >> >> >>> command that I am missing here.
> >> >> >>>
> >> >> >>> Thank you.
> >> >> >>>
> >> >> >>> Kevin
> >> >> >>>
> >> >> >>> ---- jim holtman <[EMAIL PROTECTED]> wrote:
> >> >> >>>> Please provide commented, minimal, self-contained, reproducible
> >> >> >>>> code,
> >> >> >>>> or at least a before/after of what you data would look like.
> >> >> >>>> Taking a
> >> >> >>>> guess at what you are asking, here is one way of doing it:
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>> x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=1:20,
> >> >> >>>>> b=runif(20))
> >> >> >>>>> x
> >> >> >>>> cat a b
> >> >> >>>> 1 B 1 0.65472393
> >> >> >>>> 2 C 2 0.35319727
> >> >> >>>> 3 B 3 0.27026015
> >> >> >>>> 4 A 4 0.99268406
> >> >> >>>> 5 C 5 0.63349326
> >> >> >>>> 6 A 6 0.21320814
> >> >> >>>> 7 C 7 0.12937235
> >> >> >>>> 8 A 8 0.47811803
> >> >> >>>> 9 A 9 0.92407447
> >> >> >>>> 10 A 10 0.59876097
> >> >> >>>> 11 A 11 0.97617069
> >> >> >>>> 12 A 12 0.73179251
> >> >> >>>> 13 B 13 0.35672691
> >> >> >>>> 14 C 14 0.43147369
> >> >> >>>> 15 C 15 0.14821156
> >> >> >>>> 16 C 16 0.01307758
> >> >> >>>> 17 B 17 0.71556607
> >> >> >>>> 18 B 18 0.10318424
> >> >> >>>> 19 C 19 0.44628435
> >> >> >>>> 20 B 20 0.64010105
> >> >> >>>>> # create a list of the indices of the data grouped by 'cat'
> >> >> >>>>> split(seq(nrow(x)), x$cat)
> >> >> >>>> $A
> >> >> >>>> [1] 4 6 8 9 10 11 12
> >> >> >>>>
> >> >> >>>> $B
> >> >> >>>> [1] 1 3 13 17 18 20
> >> >> >>>>
> >> >> >>>> $C
> >> >> >>>> [1] 2 5 7 14 15 16 19
> >> >> >>>>
> >> >> >>>>> # or do you want the data
> >> >> >>>>> split(x, x$cat)
> >> >> >>>> $A
> >> >> >>>> cat a b
> >> >> >>>> 4 A 4 0.9926841
> >> >> >>>> 6 A 6 0.2132081
> >> >> >>>> 8 A 8 0.4781180
> >> >> >>>> 9 A 9 0.9240745
> >> >> >>>> 10 A 10 0.5987610
> >> >> >>>> 11 A 11 0.9761707
> >> >> >>>> 12 A 12 0.7317925
> >> >> >>>>
> >> >> >>>> $B
> >> >> >>>> cat a b
> >> >> >>>> 1 B 1 0.6547239
> >> >> >>>> 3 B 3 0.2702601
> >> >> >>>> 13 B 13 0.3567269
> >> >> >>>> 17 B 17 0.7155661
> >> >> >>>> 18 B 18 0.1031842
> >> >> >>>> 20 B 20 0.6401010
> >> >> >>>>
> >> >> >>>> $C
> >> >> >>>> cat a b
> >> >> >>>> 2 C 2 0.35319727
> >> >> >>>> 5 C 5 0.63349326
> >> >> >>>> 7 C 7 0.12937235
> >> >> >>>> 14 C 14 0.43147369
> >> >> >>>> 15 C 15 0.14821156
> >> >> >>>> 16 C 16 0.01307758
> >> >> >>>> 19 C 19 0.44628435
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> On Sat, Jul 12, 2008 at 3:32 AM, <[EMAIL PROTECTED]> wrote:
> >> >> >>>>> I have search the archive and I could not find what I need so I
> >> >> >>>>> will try to ask the question here.
> >> >> >>>>>
> >> >> >>>>> I read a table in (read.table)
> >> >> >>>>>
> >> >> >>>>> a <- read.table(.....)
> >> >> >>>>>
> >> >> >>>>> The table has column names like DayOfYear, Quantity, and Category.
> >> >> >>>>>
> >> >> >>>>> The values in the row for Category are strings (characters).
> >> >> >>>>>
> >> >> >>>>> I want to get all of the rows grouped by Category. The number of
> >> >> >>>>> unique category names could be around 50. Say for argument sake
> >> >> >>>>> the number of categories is exactly 50. Can I somehow get a
> >> >> >>>>> vector of length 50 containing the rows corresponding to the
> >> >> >>>>> category (another vector)? I realize I can access any row
> >> >> >>>>> a[i]$Category (right?). But I wanta vector containing the rows
> >> >> >>>>> corresponding to each distinct Category name.
> >> >> >>>>>
> >> >> >>>>> Thank you.
> >> >> >>>>>
> >> >> >>>>> Kevin
> >> >> >>>>>
> >> >> >>>>> ______________________________________________
> >> >> >>>>> [email protected] mailing list
> >> >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> >>>>> PLEASE do read the posting guide
> >> >> >>>>> http://www.R-project.org/posting-guide.html
> >> >> >>>>> and provide commented, minimal, self-contained, reproducible code.
> >> >> >>>>>
> >> >> >>>>
> >> >> >>>> --
> >> >> >>>> Jim Holtman
> >> >> >>>> Cincinnati, OH
> >> >> >>>> +1 513 646 9390
> >> >> >>>>
> >> >> >>>> What is the problem you are trying to solve?
> >> >> >>> ______________________________________________
> >> >> >>> [email protected] mailing list
> >> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >> >> >>> PLEASE do read the posting guide
> >> >> >>> http://www.R-project.org/posting-guide.html
> >> >> >>> and provide commented, minimal, self-contained, reproducible code.
> >> >>
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jim Holtman
> >> Cincinnati, OH
> >> +1 513 646 9390
> >>
> >> What is the problem you are trying to solve?
> >
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.