On Feb 9, 2010, at 11:24 AM, Alex Levitchi wrote:
Hello
I am recently began to work with R, so I am not so experienced.
But anyway I cannot find a clear way to process my dataframe which
is a bigger one.
It shows similar to this
name=c("A","B","C","B","C","C","C","B","C")
nicknames=c("A1","B1","C1","B2","C2","C3","C4","B3","C5")
value=c(4,5,9,2,7,6,3,6,7)
table=data.frame(cbind(name,nickname,value))
table=data.frame(cbind(name,nicknames,value))
table
name nicknames value
1 A A1 4
2 B B1 5
3 C C1 9
4 B B2 2
5 C C2 7
6 C C3 6
7 C C4 3
8 B B3 6
9 C C5 7
So I have to rearrange it in the next way:
- the first column should contain just unduplicated data, I did
this, it is OK and it will look like
1 A
2 B
3 C
- the second column should contain different 'nicknames' which
correspond to the single A, B or C
name nickname value
1 A A1
2 B B1,B2,B3
3 C C1,C2,C3,C4,C5
Dataframes are not designed to hold irregular length items. Lists are
the data structure best suited for this type of data. tapply() is one
function useful for colecting elements of one structure based on the
contents of another ("name"):
(I renamed your table object "table1" to avoid confusion with the
table function.)
> tapply(table1$nicknames, table1$name, list)
$A
[1] A1
Levels: A1 B1 B2 B3 C1 C2 C3 C4 C5
$B
[1] B1 B2 B3
Levels: A1 B1 B2 B3 C1 C2 C3 C4 C5
$C
[1] C1 C2 C3 C4 C5
Levels: A1 B1 B2 B3 C1 C2 C3 C4 C5
The process of tabulating has created factor variables which some
would see as a good thing, but perhaps was not desired. Since you now
have a lis, you can sequentially apply the as.character function to
recover only the character vectors:
>lapply( tapply(table1$nicknames, table1$name, list), as.character)
$A
[1] "A1"
$B
[1] "B1" "B2" "B3"
$C
[1] "C1" "C2" "C3" "C4" "C5"
Then I saw the rest of your request, so forget the above and see if
this two-liner looks a bit more simple.
> tcollapse <- tapply(table1$nicknames, table1$name, paste,
collapse=", ")
#gets you the strings separated by commas and spaces.
> cbind(names(tcollapse), tcollapse, lapply( tapply(table1$nicknames,
table1$name, list), length) )
tcollapse
A "A" "A1" 1
B "B" "B1, B2, B3" 3
C "C" "C1, C2, C3, C4, C5" 5
You can obviously name them whatever you like.
--
David
-the third one should contain the mean value of the numbers which
correspond to the same A, B or C
1 A A1 mean(4)
2 B B1,B2,B3 mean(5,2,6)
3 C C1,C2,C3,C4,C5 mean(9,7,6,3,7)
I did this using a loop 'for'.
to be clear I created tree dataframes which correspond to each of
columns, and finally will combine them
ulist=which(!duplicated(table$name)) # I extract the list of
positions in which I don't have duplications
name1=data.frame(table$name[ulist]) # I extract the list of unique
names
nicknames1=data.frame(row.names(1:length(ulist))) # I create a
dataframe of dimension equal to unique list length
value1=data.frame(row.names(1:length(ulist))) # I create a
dataframe of dimension equal to unique list length
for(i in 1:length(ulist)) {
position=which(as.character(name1[i,1])==table$name)
nicknames1[i,1]=toString(table$nicknames[position])
value1[i,1]=mean(as.numeric(table$value[position]))
}
fin=cbind(name1,nicknames1,value1)
colnames(fin)=c("NAME","NICKNAME","VALUE")
fin
NAME NICKNAME VALUE
1 A A1 3.000000
2 B B1, B2, B3 3.333333
3 C C1, C2, C3, C4, C5 5.200000
it works successfully. But in general I work with dataframes of high
dimensions (tens thousands or more rows).
So my loop works too slow (i.e., a dataframe of 20000 rows and 3
columns is processed in about 10 minutes).
I intend to integrate it into a function, so it is obvious that time
will be even longer.
If someone can advise me any possibility to modify which I have done
or to the way I can do it, please give me a message.
King regards to all guys who develop and maintain R sources for such
dummies as me
Alex Levitchi
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.