On Feb 9, 2010, at 11:24 AM, Alex Levitchi wrote:

Hello
I am recently began to work with R, so I am not so experienced.
But anyway I cannot find a clear way to process my dataframe which is a bigger one.
It shows similar to this

name=c("A","B","C","B","C","C","C","B","C")
nicknames=c("A1","B1","C1","B2","C2","C3","C4","B3","C5")
value=c(4,5,9,2,7,6,3,6,7)
table=data.frame(cbind(name,nickname,value))
table=data.frame(cbind(name,nicknames,value))
table
name nicknames value
1 A A1 4
2 B B1 5
3 C C1 9
4 B B2 2
5 C C2 7
6 C C3 6
7 C C4 3
8 B B3 6
9 C C5 7

So I have to rearrange it in the next way:
- the first column should contain just unduplicated data, I did this, it is OK and it will look like
1 A
2 B
3 C

- the second column should contain different 'nicknames' which correspond to the single A, B or C
name nickname value
1 A A1
2 B B1,B2,B3
3 C C1,C2,C3,C4,C5

Dataframes are not designed to hold irregular length items. Lists are the data structure best suited for this type of data. tapply() is one function useful for colecting elements of one structure based on the contents of another ("name"):

(I renamed your table object "table1" to avoid confusion with the table function.)

> tapply(table1$nicknames, table1$name, list)
$A
[1] A1
Levels: A1 B1 B2 B3 C1 C2 C3 C4 C5

$B
[1] B1 B2 B3
Levels: A1 B1 B2 B3 C1 C2 C3 C4 C5

$C
[1] C1 C2 C3 C4 C5
Levels: A1 B1 B2 B3 C1 C2 C3 C4 C5

The process of tabulating has created factor variables which some would see as a good thing, but perhaps was not desired. Since you now have a lis, you can sequentially apply the as.character function to recover only the character vectors:

>lapply( tapply(table1$nicknames, table1$name, list), as.character)
$A
[1] "A1"

$B
[1] "B1" "B2" "B3"

$C
[1] "C1" "C2" "C3" "C4" "C5"

Then I saw the rest of your request, so forget the above and see if this two-liner looks a bit more simple.

> tcollapse <- tapply(table1$nicknames, table1$name, paste, collapse=", ")
#gets you the strings separated by commas and spaces.

> cbind(names(tcollapse), tcollapse, lapply( tapply(table1$nicknames, table1$name, list), length) )
      tcollapse
A "A" "A1"                 1
B "B" "B1, B2, B3"         3
C "C" "C1, C2, C3, C4, C5" 5

You can obviously name them whatever you like.

--
David

-the third one should contain the mean value of the numbers which correspond to the same A, B or C
1 A A1 mean(4)
2 B B1,B2,B3 mean(5,2,6)
3 C C1,C2,C3,C4,C5 mean(9,7,6,3,7)

I did this using a loop 'for'.
to be clear I created tree dataframes which correspond to each of columns, and finally will combine them

ulist=which(!duplicated(table$name)) # I extract the list of positions in which I don't have duplications name1=data.frame(table$name[ulist]) # I extract the list of unique names nicknames1=data.frame(row.names(1:length(ulist))) # I create a dataframe of dimension equal to unique list length value1=data.frame(row.names(1:length(ulist))) # I create a dataframe of dimension equal to unique list length

for(i in 1:length(ulist)) {
position=which(as.character(name1[i,1])==table$name)
nicknames1[i,1]=toString(table$nicknames[position])
value1[i,1]=mean(as.numeric(table$value[position]))
}
fin=cbind(name1,nicknames1,value1)
colnames(fin)=c("NAME","NICKNAME","VALUE")
fin
NAME NICKNAME VALUE
1 A A1 3.000000
2 B B1, B2, B3 3.333333
3 C C1, C2, C3, C4, C5 5.200000

it works successfully. But in general I work with dataframes of high dimensions (tens thousands or more rows). So my loop works too slow (i.e., a dataframe of 20000 rows and 3 columns is processed in about 10 minutes). I intend to integrate it into a function, so it is obvious that time will be even longer.

If someone can advise me any possibility to modify which I have done or to the way I can do it, please give me a message.

King regards to all guys who develop and maintain R sources for such dummies as me
Alex Levitchi



        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to