The reason for the empty levels was I did not put drop=TRUE on the
split to remove unused levels. Here is the revised script:
> set.seed(1) # start with a known number
> x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=sample(letters[1:4], 20,
> TRUE), b=runif(20))
> x
cat a b
1 A d 0.82094629
2 B a 0.64706019
3 B c 0.78293276
4 C a 0.55303631
5 A b 0.52971958
6 C b 0.78935623
7 C a 0.02333120
8 B b 0.47723007
9 B d 0.73231374
10 A b 0.69273156
11 A b 0.47761962
12 A c 0.86120948
13 C b 0.43809711
14 B a 0.24479728
15 C d 0.07067905
16 B c 0.09946616
17 C d 0.31627171
18 C a 0.51863426
19 B c 0.66200508
20 C b 0.40683019
> # drop unused groups from the split
> (z <- split(x, list(x$cat, x$a), drop=TRUE))
$B.a
cat a b
2 B a 0.6470602
14 B a 0.2447973
$C.a
cat a b
4 C a 0.55303631
7 C a 0.02333120
18 C a 0.51863426
$A.b
cat a b
5 A b 0.5297196
10 A b 0.6927316
11 A b 0.4776196
$B.b
cat a b
8 B b 0.4772301
$C.b
cat a b
6 C b 0.7893562
13 C b 0.4380971
20 C b 0.4068302
$A.c
cat a b
12 A c 0.8612095
$B.c
cat a b
3 B c 0.78293276
16 B c 0.09946616
19 B c 0.66200508
$A.d
cat a b
1 A d 0.8209463
$B.d
cat a b
9 B d 0.7323137
$C.d
cat a b
15 C d 0.07067905
17 C d 0.31627171
> # access the value ('b' in this instance); two ways- should be the same
> z[[1]]$b
[1] 0.6470602 0.2447973
> z$B.a$b
[1] 0.6470602 0.2447973
>
>
>
>
On Sun, Jul 13, 2008 at 1:26 AM, <[EMAIL PROTECTED]> wrote:
> This is almost it. Maybe it is as good as can be expected. The only problem
> that I see is that this seems to form a Category/SubCategory pair where none
> existed in the original data. For example, A might have two sub-categories a
> and b, and B might have two categories c and d. As far as I can tell the
> method that you outlined forms a Category/SubCategory pair like B a or B b
> where none existed. This results in alot of empty lists and it seems to take
> a long time to generate. But if that is as good as it gets then I can live
> with it.
>
> I know that I said one more question. But I have run into a problem. c <-
> split(x, x$Category) returns a vector of the rows in each of the categories.
> Now I would like to access the "Quantity" column within this split vector. I
> can see it listed. I just can't access it. I have tried c[1]$Quantity and
> c[1,2] both which give me errors. Any ideas?
>
> Sorry this is so hard for me. I am more used to C type arrays and C type
> arrays of structures. This seems to be somewhat different.
>
> Thank you.
>
> Kevin
> ---- jim holtman <[EMAIL PROTECTED]> wrote:
>> Is this something like what you were asking for? The output of a
>> 'split' will be a list of the dataframe subsets for the categories you
>> have specified.
>>
>> > x <- data.frame(g1=sample(LETTERS[1:2],30,TRUE),
>> + g2=sample(letters[1:2], 30, TRUE),
>> + g3=1:30)
>> > y <- split(x, list(x$g1, x$g2))
>> > str(y)
>> List of 4
>> $ A.a:'data.frame': 7 obs. of 3 variables:
>> ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1
>> ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1
>> ..$ g3: int [1:7] 3 4 6 8 9 13 24
>> $ B.a:'data.frame': 7 obs. of 3 variables:
>> ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2
>> ..$ g2: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1
>> ..$ g3: int [1:7] 10 11 16 17 18 20 25
>> $ A.b:'data.frame': 6 obs. of 3 variables:
>> ..$ g1: Factor w/ 2 levels "A","B": 1 1 1 1 1 1
>> ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2
>> ..$ g3: int [1:6] 2 12 23 26 27 29
>> $ B.b:'data.frame': 10 obs. of 3 variables:
>> ..$ g1: Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2
>> ..$ g2: Factor w/ 2 levels "a","b": 2 2 2 2 2 2 2 2 2 2
>> ..$ g3: int [1:10] 1 5 7 14 15 19 21 22 28 30
>> > y
>> $A.a
>> g1 g2 g3
>> 3 A a 3
>> 4 A a 4
>> 6 A a 6
>> 8 A a 8
>> 9 A a 9
>> 13 A a 13
>> 24 A a 24
>>
>> $B.a
>> g1 g2 g3
>> 10 B a 10
>> 11 B a 11
>> 16 B a 16
>> 17 B a 17
>> 18 B a 18
>> 20 B a 20
>> 25 B a 25
>>
>> $A.b
>> g1 g2 g3
>> 2 A b 2
>> 12 A b 12
>> 23 A b 23
>> 26 A b 26
>> 27 A b 27
>> 29 A b 29
>>
>> $B.b
>> g1 g2 g3
>> 1 B b 1
>> 5 B b 5
>> 7 B b 7
>> 14 B b 14
>> 15 B b 15
>> 19 B b 19
>> 21 B b 21
>> 22 B b 22
>> 28 B b 28
>> 30 B b 30
>>
>> > y[[2]]
>> g1 g2 g3
>> 10 B a 10
>> 11 B a 11
>> 16 B a 16
>> 17 B a 17
>> 18 B a 18
>> 20 B a 20
>> 25 B a 25
>> >
>> >
>> >
>>
>>
>> On Sat, Jul 12, 2008 at 8:51 PM, <[EMAIL PROTECTED]> wrote:
>> > OK. Now I know that I am dealing with a data frame. One last question on
>> > this topic. a <- read.csv() gives me a dataframe. If I have 'c <- split(x,
>> > x$Category), then what is returned by split in this case? c[1] seems to
>> > be OK but c[2] is not right in my mind. If I run ci <- split(nrow(a),
>> > a$Category). And then ci[1] seems to be the rows associated with the first
>> > category, c[2] is the indices/rows associated with the second category,
>> > etc. But this seems different than c[1], c[2], etc.
>> >
>> > Using the techniques below I can get the information on the categories.
>> > Now as an extra level of complexity there are SubCategories within each
>> > Category. Assume that the SubCategory names are not unique within the
>> > dataset so if I want the SubCategory data I need to retrive the indices
>> > (or data) for the Category and SubCategory pair. In other words if I have
>> > a Category that ranges from 'A' to 'Z', it is possible that I might have a
>> > subcategory A a, A b (where a and b are the sub category names). I also
>> > might have B a, B b. I want all of the sub categories A a. NOT the
>> > subcategories a (because that might include B a which would be different).
>> > I am guessing that this will take more than a simple 'split'.
>> >
>> > Thank you.
>> >
>> > Kevin
>> >
>> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote:
>> >> On 12/07/2008 3:59 PM, [EMAIL PROTECTED] wrote:
>> >> > I am sorry but if read.csv returns a dataframe and a dataframe is like
>> >> > a matrix and I have a set of input like below and a[1,] gives me the
>> >> > first row, what is the second index? From what I read and your input I
>> >> > am guessing that it is the column number. So a[1,1] would return the
>> >> > DayOfYear column for the first row, right? What does a$DayOfYear return?
>> >>
>> >> a$DayOfYear would be the same as a[,1] or a[,"DayOfYear"], i.e. it would
>> >> return the entire first column.
>> >>
>> >> Duncan Murdoch
>> >>
>> >> >
>> >> > Thank you for your patience.
>> >> >
>> >> > Kevin
>> >> >
>> >> > ---- Duncan Murdoch <[EMAIL PROTECTED]> wrote:
>> >> >> On 12/07/2008 12:31 PM, [EMAIL PROTECTED] wrote:
>> >> >>> I am using a simple R statement to read in the file:
>> >> >>>
>> >> >>> a <- read.csv("Sample.dat", header=TRUE)
>> >> >>>
>> >> >>> There is alot of data but the first few lines look like:
>> >> >>>
>> >> >>> DayOfYear,Quantity,Fraction,Category,SubCategory
>> >> >>> 1,82,0.0000390392720794458,(Unknown),(Unknown)
>> >> >>> 2,78,0.0000371349173438631,(Unknown),(Unknown)
>> >> >>> . . .
>> >> >>> 71,2,0.0000009521773677913,WOMEN,Piratesses
>> >> >>> 72,4,0.0000019043547355827,WOMEN,Piratesses
>> >> >>> 73,3,0.0000014282660516870,WOMEN,Piratesses
>> >> >>> 74,14,0.0000066652415745395,WOMEN,Piratesses
>> >> >>> 75,2,0.0000009521773677913,WOMEN,Piratesses
>> >> >>>
>> >> >>> If I read the data in as above, the command
>> >> >>>
>> >> >>> a[1]
>> >> >>>
>> >> >>> results in the output
>> >> >>>
>> >> >>> [ reached getOption("max.print") -- omitted 16193 rows ]]
>> >> >>>
>> >> >>> Shouldn't this be the first row?
>> >> >> No, the first row would be a[1,]. read.csv() returns a dataframe, and
>> >> >> those are indexed with two indices to treat them like a matrix, or with
>> >> >> one index to treat them like a list of their columns.
>> >> >>
>> >> >> Duncan Murdoch
>> >> >>
>> >> >>> a$Category[1]
>> >> >>>
>> >> >>> results in the output
>> >> >>>
>> >> >>> [1] (Unknown)
>> >> >>> 4464 Levels: Tags ... WOMEN
>> >> >>>
>> >> >>> But
>> >> >>>
>> >> >>> a$Category[365]
>> >> >>>
>> >> >>> gives me:
>> >> >>>
>> >> >>> [1] 7 Plates (Dessert),Western\n120,5,0.0000023804434194784,7
>> >> >>> Plates (Dessert)
>> >> >>> 4464 Levels: Tags ... WOMEN
>> >> >>>
>> >> >>> There is something fundamental about either vectors of the read.csv
>> >> >>> command that I am missing here.
>> >> >>>
>> >> >>> Thank you.
>> >> >>>
>> >> >>> Kevin
>> >> >>>
>> >> >>> ---- jim holtman <[EMAIL PROTECTED]> wrote:
>> >> >>>> Please provide commented, minimal, self-contained, reproducible code,
>> >> >>>> or at least a before/after of what you data would look like. Taking
>> >> >>>> a
>> >> >>>> guess at what you are asking, here is one way of doing it:
>> >> >>>>
>> >> >>>>
>> >> >>>>> x <- data.frame(cat=sample(LETTERS[1:3],20,TRUE),a=1:20,
>> >> >>>>> b=runif(20))
>> >> >>>>> x
>> >> >>>> cat a b
>> >> >>>> 1 B 1 0.65472393
>> >> >>>> 2 C 2 0.35319727
>> >> >>>> 3 B 3 0.27026015
>> >> >>>> 4 A 4 0.99268406
>> >> >>>> 5 C 5 0.63349326
>> >> >>>> 6 A 6 0.21320814
>> >> >>>> 7 C 7 0.12937235
>> >> >>>> 8 A 8 0.47811803
>> >> >>>> 9 A 9 0.92407447
>> >> >>>> 10 A 10 0.59876097
>> >> >>>> 11 A 11 0.97617069
>> >> >>>> 12 A 12 0.73179251
>> >> >>>> 13 B 13 0.35672691
>> >> >>>> 14 C 14 0.43147369
>> >> >>>> 15 C 15 0.14821156
>> >> >>>> 16 C 16 0.01307758
>> >> >>>> 17 B 17 0.71556607
>> >> >>>> 18 B 18 0.10318424
>> >> >>>> 19 C 19 0.44628435
>> >> >>>> 20 B 20 0.64010105
>> >> >>>>> # create a list of the indices of the data grouped by 'cat'
>> >> >>>>> split(seq(nrow(x)), x$cat)
>> >> >>>> $A
>> >> >>>> [1] 4 6 8 9 10 11 12
>> >> >>>>
>> >> >>>> $B
>> >> >>>> [1] 1 3 13 17 18 20
>> >> >>>>
>> >> >>>> $C
>> >> >>>> [1] 2 5 7 14 15 16 19
>> >> >>>>
>> >> >>>>> # or do you want the data
>> >> >>>>> split(x, x$cat)
>> >> >>>> $A
>> >> >>>> cat a b
>> >> >>>> 4 A 4 0.9926841
>> >> >>>> 6 A 6 0.2132081
>> >> >>>> 8 A 8 0.4781180
>> >> >>>> 9 A 9 0.9240745
>> >> >>>> 10 A 10 0.5987610
>> >> >>>> 11 A 11 0.9761707
>> >> >>>> 12 A 12 0.7317925
>> >> >>>>
>> >> >>>> $B
>> >> >>>> cat a b
>> >> >>>> 1 B 1 0.6547239
>> >> >>>> 3 B 3 0.2702601
>> >> >>>> 13 B 13 0.3567269
>> >> >>>> 17 B 17 0.7155661
>> >> >>>> 18 B 18 0.1031842
>> >> >>>> 20 B 20 0.6401010
>> >> >>>>
>> >> >>>> $C
>> >> >>>> cat a b
>> >> >>>> 2 C 2 0.35319727
>> >> >>>> 5 C 5 0.63349326
>> >> >>>> 7 C 7 0.12937235
>> >> >>>> 14 C 14 0.43147369
>> >> >>>> 15 C 15 0.14821156
>> >> >>>> 16 C 16 0.01307758
>> >> >>>> 19 C 19 0.44628435
>> >> >>>>
>> >> >>>>
>> >> >>>> On Sat, Jul 12, 2008 at 3:32 AM, <[EMAIL PROTECTED]> wrote:
>> >> >>>>> I have search the archive and I could not find what I need so I
>> >> >>>>> will try to ask the question here.
>> >> >>>>>
>> >> >>>>> I read a table in (read.table)
>> >> >>>>>
>> >> >>>>> a <- read.table(.....)
>> >> >>>>>
>> >> >>>>> The table has column names like DayOfYear, Quantity, and Category.
>> >> >>>>>
>> >> >>>>> The values in the row for Category are strings (characters).
>> >> >>>>>
>> >> >>>>> I want to get all of the rows grouped by Category. The number of
>> >> >>>>> unique category names could be around 50. Say for argument sake the
>> >> >>>>> number of categories is exactly 50. Can I somehow get a vector of
>> >> >>>>> length 50 containing the rows corresponding to the category
>> >> >>>>> (another vector)? I realize I can access any row a[i]$Category
>> >> >>>>> (right?). But I wanta vector containing the rows corresponding to
>> >> >>>>> each distinct Category name.
>> >> >>>>>
>> >> >>>>> Thank you.
>> >> >>>>>
>> >> >>>>> Kevin
>> >> >>>>>
>> >> >>>>> ______________________________________________
>> >> >>>>> [email protected] mailing list
>> >> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>>>> PLEASE do read the posting guide
>> >> >>>>> http://www.R-project.org/posting-guide.html
>> >> >>>>> and provide commented, minimal, self-contained, reproducible code.
>> >> >>>>>
>> >> >>>>
>> >> >>>> --
>> >> >>>> Jim Holtman
>> >> >>>> Cincinnati, OH
>> >> >>>> +1 513 646 9390
>> >> >>>>
>> >> >>>> What is the problem you are trying to solve?
>> >> >>> ______________________________________________
>> >> >>> [email protected] mailing list
>> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>> PLEASE do read the posting guide
>> >> >>> http://www.R-project.org/posting-guide.html
>> >> >>> and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem you are trying to solve?
>
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem you are trying to solve?
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.