So.. here we try again. ##generate dataset age.cat<-seq(0,100,10) year<-(1953:(1953+55)) data.vec<-sample(10000:10000,(age.cat*year)) data.matrix<-matrix(data.vec,c(length(age.cat),length(year)) rownames(data.matrix)<-age.cat colnames(data.matrix)<-year
##divide into 5 year periods age.div<-cut(year,seq(1950,2010,6),include.lowest=T) ##interval is beyond my datainterval so I doubt the include.lowest matters Now what I'd like to do is summarise the rows within the 5-year intervals. I did read about apply in its different variants and Dahlgaard, but I do not know understand how it could be applied in this setting. I tried making an array and summarise by that (used the vector and applied it into a length(age.cat)*max(vector(table(age.div)*length(age.div) array. It worked but required a bit of tweaking (inserting null columns) and I find myself in this situation quite often whereby I need to add multiple columns based on another vector so I'd be very interested in another more general approach. //M On Tue, Apr 6, 2010 at 9:41 PM, David Winsemius <dwinsem...@comcast.net> wrote: > > On Apr 6, 2010, at 3:30 PM, David Winsemius wrote: > >> >> On Apr 6, 2010, at 9:56 AM, moleps islon wrote: >> >>> OK... next question.. Which is still a data manipulation problem so I >>> believe the heading is still OK. >>> >>> ##So now I read my population data from excel. >> >> No, you read it from a text file and providing the first ten lines of that >> text file should have been really easy. Read the Posting Guide for advice >> about offering datasets either as structure() objects with dput or dump or >> as attached files with "*.txt" extension (not .csv). Just change the file >> name with your file browser. >> >>> pop<-read.csv("pop.csv") >>> >>> typeof(pop) ## yields a list >> >> Really? I would have guessed it to yield just "list". >> >>> where I have age-specific population rows >>> and a yearly column population, where the years are suffixed by X >> >> And had you used class(pop) you would have learned it was a dataframe and >> even more informative would have been str(pop). >>> >>> c<-(1953:2008) >> >> No, no, no. Do not use variable names that are important function names. >> The R interpreter can (usually) keep things straight but it is our brains >> that experience problems. Other function names to avoid: data, df, cut, >> mean, sd, list, vector, matrix >> >>> names(pop)<-c >>> c.div<-cut(c,break=seq(1950,2010,by=5) >> >> (You should have gotten an error here.) After fixing the error, did you >> you notice that there were only 3 of the first level??? >> >> Watch out for cut(). It uses the default convention of ( , ] , i.e. open >> interval at right > > er, > ^left^ > >> which is backwards to what some (most?) of us think natural. Because of >> that the lowest level gets dropped unless you take special precautions. >> That is undoubtedly why Harrell set up his Hmisc::cut2 to have the default >> be [ , ) >> >> Aggregating across columns? Certainly possible, but maybe not as natural a >> fit to functions like split as would occur with working across rows. I >> suppose you could use something like this untested (because _still_ no >> sample dataset provided) code: >> >> apply(pop, 1, # this works a row a time >> function(x) tapply(x, list(c.div), sum) ) ) # or use aggregate which >> uses tapply >> >> I'm not sure it will work, since I don't know if the column names would >> get carried over into "x" by apply(). You might need to create a separate >> index that used the numeric positions of the columns rather than their >> names. Perhaps use c.div <- seq(0,(2008-1953)) %/% 5 or some such inside >> tapply. >> >>> >>> Now I'd like to sum the agespecific population over the individual >>> levels of -c.div- and generate a new table for this with agespecific >>> rows and columns containing the 5-year bins instead of the original >>> yearly data. Do I have to program this from scratch or is it possible >>> to use an already existing function? >> >> I think you ought to read more introductory material (and the Posting >> Guide regarding how to offer example datasets). In this case there are many >> functions that do data aggregation and most of them should be illustrated in >> a good introductory text. >> >> -- >> David. >>> >>> >>> //M >>> >>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest = >>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE >>> >>> On Mon, Apr 5, 2010 at 10:11 PM, moleps <mole...@gmail.com> wrote: >>>> >>>> Thx Erik, >>>> I have no idea what went wrong with the other code snippet, but this one >>>> works.. Appreciate it. >>>> >>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest = >>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE)) >>>> >>>> M >>>> >>>> >>>> On 5. apr. 2010, at 21.45, Erik Iverson wrote: >>>> >>>>> I don't know what your data are like, since you haven't given a >>>>> reproducible example. I was imagining something like: >>>>> >>>>> ## generate fake data >>>>> age <- sample(20:90, 100, replace = TRUE) >>>>> year <- sample(1950:2000, 100, replace = TRUE) >>>>> >>>>> ##look at big table >>>>> table(age, year) >>>>> >>>>> ## categorize data >>>>> ## see include.lowest and right arguments to cut >>>>> age.factor <- cut(age, breaks = seq(20, 90, by = 10), >>>>> include.lowest = TRUE) >>>>> >>>>> year.factor <- cut(year, breaks = seq(1950, 2000, by = 10), >>>>> include.lowest = TRUE) >>>>> >>>>> table(age.factor, year.factor) >>>>> >>>>> moleps wrote: >>>>>> >>>>>> I already did try the regression modeling approach. However the >>>>>> epidemiologists (referee) turns out to be quite fond of comparing the >>>>>> incidence rates to different standard populations, hence the need for >>>>>> this >>>>>> labourius approach. And trying the "cutting" approach I ended up with : >>>>>>> >>>>>>> table (age5) >>>>>> >>>>>> age5 >>>>>> (0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] (35,40] >>>>>> (40,45] (45,50] (50,55] (55,60] (60,65] (65,70] (70,75] (75,80] >>>>>> (80,85] (85,100] 35 34 33 47 51 109 >>>>>> 157 231 362 511 745 926 1002 866 >>>>>> 547 >>>>>> 247 82 18 >>>>>>> >>>>>>> table (yr5) >>>>>> >>>>>> yr5 >>>>>> (1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975] >>>>>> (1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000] (2000,2005] >>>>>> (2005,2009] 3 5 5 5 5 >>>>>> 5 5 5 5 5 5 >>>>>> 3 >>>>>>> >>>>>>> table (yr5,age5) >>>>>> >>>>>> Error in table(yr5, age5) : all arguments must have the same length >>>>>> Sincerely, >>>>>> M >>>>>> On 5. apr. 2010, at 20.59, Bert Gunter wrote: >>>>>>> >>>>>>> You have tempted, and being weak, I yield to temptation: >>>>>>> >>>>>>> "Any good ideas?" >>>>>>> >>>>>>> Yes. Don't do this. >>>>>>> >>>>>>> (what you probably really want to do is fit a model with age as a >>>>>>> factor, >>>>>>> which can be done statistically e.g. by logistic regression; or >>>>>>> graphically >>>>>>> using conditioning plots, e.g. via trellis graphics (the lattice >>>>>>> package). >>>>>>> This avoids the arbitrariness and discontinuities of binning by age >>>>>>> range.) >>>>>>> >>>>>>> Bert Gunter >>>>>>> Genentech Nonclinical Biostatistics >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: r-help-boun...@r-project.org >>>>>>> [mailto:r-help-boun...@r-project.org] On >>>>>>> Behalf Of moleps >>>>>>> Sent: Monday, April 05, 2010 11:46 AM >>>>>>> To: r-help@r-project.org >>>>>>> Subject: [R] Data manipulation problem >>>>>>> >>>>>>> Dear R´ers. >>>>>>> >>>>>>> I´ve got a dataset with age and year of diagnosis. In order to >>>>>>> age-standardize the incidence I need to transform the data into a >>>>>>> matrix >>>>>>> with age-groups (divided in 5 or 10 years) along one axis and year >>>>>>> divided >>>>>>> into 5 years along the other axis. Each cell should contain the >>>>>>> number of >>>>>>> cases for that age group and for that period. >>>>>>> I.e. >>>>>>> My data format now is >>>>>>> ID-age (to one decimal)-year(yearly data). >>>>>>> >>>>>>> What I´d like is >>>>>>> >>>>>>> age 1960-1965 1966-1970 etc... >>>>>>> 0-5 3 8 10 15 >>>>>>> 6-10 2 5 8 13 >>>>>>> etc.. >>>>>>> >>>>>>> >>>>>>> Any good ideas? >>>>>>> >>>>>>> Regards, >>>>>>> M >>> >> > > David Winsemius, MD > West Hartford, CT > > On Tue, Apr 6, 2010 at 9:41 PM, David Winsemius <dwinsem...@comcast.net> wrote: > > On Apr 6, 2010, at 3:30 PM, David Winsemius wrote: > >> >> On Apr 6, 2010, at 9:56 AM, moleps islon wrote: >> >>> OK... next question.. Which is still a data manipulation problem so I >>> believe the heading is still OK. >>> >>> ##So now I read my population data from excel. >> >> No, you read it from a text file and providing the first ten lines of that >> text file should have been really easy. Read the Posting Guide for advice >> about offering datasets either as structure() objects with dput or dump or >> as attached files with "*.txt" extension (not .csv). Just change the file >> name with your file browser. >> >>> pop<-read.csv("pop.csv") >>> >>> typeof(pop) ## yields a list >> >> Really? I would have guessed it to yield just "list". >> >>> where I have age-specific population rows >>> and a yearly column population, where the years are suffixed by X >> >> And had you used class(pop) you would have learned it was a dataframe and >> even more informative would have been str(pop). >>> >>> c<-(1953:2008) >> >> No, no, no. Do not use variable names that are important function names. >> The R interpreter can (usually) keep things straight but it is our brains >> that experience problems. Other function names to avoid: data, df, cut, >> mean, sd, list, vector, matrix >> >>> names(pop)<-c >>> c.div<-cut(c,break=seq(1950,2010,by=5) >> >> (You should have gotten an error here.) After fixing the error, did you >> you notice that there were only 3 of the first level??? >> >> Watch out for cut(). It uses the default convention of ( , ] , i.e. open >> interval at right > > er, > ^left^ > >> which is backwards to what some (most?) of us think natural. Because of >> that the lowest level gets dropped unless you take special precautions. >> That is undoubtedly why Harrell set up his Hmisc::cut2 to have the default >> be [ , ) >> >> Aggregating across columns? Certainly possible, but maybe not as natural a >> fit to functions like split as would occur with working across rows. I >> suppose you could use something like this untested (because _still_ no >> sample dataset provided) code: >> >> apply(pop, 1, # this works a row a time >> function(x) tapply(x, list(c.div), sum) ) ) # or use aggregate which >> uses tapply >> >> I'm not sure it will work, since I don't know if the column names would >> get carried over into "x" by apply(). You might need to create a separate >> index that used the numeric positions of the columns rather than their >> names. Perhaps use c.div <- seq(0,(2008-1953)) %/% 5 or some such inside >> tapply. >> >>> >>> Now I'd like to sum the agespecific population over the individual >>> levels of -c.div- and generate a new table for this with agespecific >>> rows and columns containing the 5-year bins instead of the original >>> yearly data. Do I have to program this from scratch or is it possible >>> to use an already existing function? >> >> I think you ought to read more introductory material (and the Posting >> Guide regarding how to offer example datasets). In this case there are many >> functions that do data aggregation and most of them should be illustrated in >> a good introductory text. >> >> -- >> David. >>> >>> >>> //M >>> >>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest = >>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE >>> >>> On Mon, Apr 5, 2010 at 10:11 PM, moleps <mole...@gmail.com> wrote: >>>> >>>> Thx Erik, >>>> I have no idea what went wrong with the other code snippet, but this one >>>> works.. Appreciate it. >>>> >>>> qta<- table(cut(age,breaks = seq(0, 100, by = 10),include.lowest = >>>> TRUE),cut(year,breaks=seq(1950,2010,by=5),include.lowest=TRUE)) >>>> >>>> M >>>> >>>> >>>> On 5. apr. 2010, at 21.45, Erik Iverson wrote: >>>> >>>>> I don't know what your data are like, since you haven't given a >>>>> reproducible example. I was imagining something like: >>>>> >>>>> ## generate fake data >>>>> age <- sample(20:90, 100, replace = TRUE) >>>>> year <- sample(1950:2000, 100, replace = TRUE) >>>>> >>>>> ##look at big table >>>>> table(age, year) >>>>> >>>>> ## categorize data >>>>> ## see include.lowest and right arguments to cut >>>>> age.factor <- cut(age, breaks = seq(20, 90, by = 10), >>>>> include.lowest = TRUE) >>>>> >>>>> year.factor <- cut(year, breaks = seq(1950, 2000, by = 10), >>>>> include.lowest = TRUE) >>>>> >>>>> table(age.factor, year.factor) >>>>> >>>>> moleps wrote: >>>>>> >>>>>> I already did try the regression modeling approach. However the >>>>>> epidemiologists (referee) turns out to be quite fond of comparing the >>>>>> incidence rates to different standard populations, hence the need for >>>>>> this >>>>>> labourius approach. And trying the "cutting" approach I ended up with : >>>>>>> >>>>>>> table (age5) >>>>>> >>>>>> age5 >>>>>> (0,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] (35,40] >>>>>> (40,45] (45,50] (50,55] (55,60] (60,65] (65,70] (70,75] (75,80] >>>>>> (80,85] (85,100] 35 34 33 47 51 109 >>>>>> 157 231 362 511 745 926 1002 866 >>>>>> 547 >>>>>> 247 82 18 >>>>>>> >>>>>>> table (yr5) >>>>>> >>>>>> yr5 >>>>>> (1950,1955] (1955,1960] (1960,1965] (1965,1970] (1970,1975] >>>>>> (1975,1980] (1980,1985] (1985,1990] (1990,1995] (1995,2000] (2000,2005] >>>>>> (2005,2009] 3 5 5 5 5 >>>>>> 5 5 5 5 5 5 >>>>>> 3 >>>>>>> >>>>>>> table (yr5,age5) >>>>>> >>>>>> Error in table(yr5, age5) : all arguments must have the same length >>>>>> Sincerely, >>>>>> M >>>>>> On 5. apr. 2010, at 20.59, Bert Gunter wrote: >>>>>>> >>>>>>> You have tempted, and being weak, I yield to temptation: >>>>>>> >>>>>>> "Any good ideas?" >>>>>>> >>>>>>> Yes. Don't do this. >>>>>>> >>>>>>> (what you probably really want to do is fit a model with age as a >>>>>>> factor, >>>>>>> which can be done statistically e.g. by logistic regression; or >>>>>>> graphically >>>>>>> using conditioning plots, e.g. via trellis graphics (the lattice >>>>>>> package). >>>>>>> This avoids the arbitrariness and discontinuities of binning by age >>>>>>> range.) >>>>>>> >>>>>>> Bert Gunter >>>>>>> Genentech Nonclinical Biostatistics >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: r-help-boun...@r-project.org >>>>>>> [mailto:r-help-boun...@r-project.org] On >>>>>>> Behalf Of moleps >>>>>>> Sent: Monday, April 05, 2010 11:46 AM >>>>>>> To: r-help@r-project.org >>>>>>> Subject: [R] Data manipulation problem >>>>>>> >>>>>>> Dear R´ers. >>>>>>> >>>>>>> I´ve got a dataset with age and year of diagnosis. In order to >>>>>>> age-standardize the incidence I need to transform the data into a >>>>>>> matrix >>>>>>> with age-groups (divided in 5 or 10 years) along one axis and year >>>>>>> divided >>>>>>> into 5 years along the other axis. Each cell should contain the >>>>>>> number of >>>>>>> cases for that age group and for that period. >>>>>>> I.e. >>>>>>> My data format now is >>>>>>> ID-age (to one decimal)-year(yearly data). >>>>>>> >>>>>>> What I´d like is >>>>>>> >>>>>>> age 1960-1965 1966-1970 etc... >>>>>>> 0-5 3 8 10 15 >>>>>>> 6-10 2 5 8 13 >>>>>>> etc.. >>>>>>> >>>>>>> >>>>>>> Any good ideas? >>>>>>> >>>>>>> Regards, >>>>>>> M >>> >> > > David Winsemius, MD > West Hartford, CT > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.