[R] Transforming/appending data (words in IMDB)

aeneas24 Tue, 28 Sep 2010 20:27:52 -0700

Hi everyone,

 
I am doing an analysis of reviews in IMDB and am running into trouble getting 
my data into the right shape for analysis. 
 
Key question: I want to know for each word in the IMDB, whether it is over- or 
under-represented in a particular category (Rating x Genre). I was figuring on 
estimating this with a g-test, fwiw. But the basic question I'm asking here is 
about data transformation/appending. To go from these columns:


Film    | Genre1    | Genre2 | Genre3 | Reviewer | Rating  | Word | Word_ct

to these:

Word | Genre | Rating | Word_ct | Word_ct_in_genre | Word_ct_in_Rating | 
Expected_word_ct | G-test-score

The actual amount of data is enormous (I have 10 files of ~1.5 GB each) and I 
suspect I'm going to have to learn how to use the bigmemory package or 
something like it. But for now, I'd be happy if I could figure out how to do 
some basic calculations on a toy-sized subset of the data.
 
Broader goal: I am trying to do is see how words pattern in different reviews. 
For example, if someone uses "gross" in a describing a romance, they probably 
gave it a low rating. But someone else describing a horror movie with "gross" 
may think of it as a positive thing.  

Below I've put what I've tried so far, with annotations about my thinking. 
-------------------------------------

# Here is what a little portion of the data looks like:
mini<-read.csv("miniimdb.csv", header=F)
head(mini)
  Film    Genre1    Genre2 Genre3 Reviewer Rating  Word Word_ct
1   Up Animation Adventure Comedy    Kelly      9   the    22
2   Up Animation Adventure Comedy    Kelly      9   dog     5
3   Up Animation Adventure Comedy    Kelly      9   can    10
4   Up Animation Adventure Comedy    Kelly      9   wow     2
5 Fame     Drama     Music   <NA>     Jenn      8   the    15
6 Fame     Drama     Music   <NA>     Jenn      8 dance    12

# Each row indicates one word as used by one reviewer for one movie. In the 
first line, we see that Kelly used "the" 22 times in her review of "Up". The 
line gives us information about her review: she gave "Up" 9 stars (out of 10), 
22/32 people found her review helpful. "Up" is listed as multiple genres, which 
will become more inconvenient later. It would be easier if each row of "Up" 
occurred as three rows--identical except for "Genre". So that's one 
transformation I'm not sure how to do. 

# I also need to sum up, say, how many times any word appears in a given genre 
(and later how often it appears in a given rating category so I can do the 
conditional probabilities and get my observed vs. expected values). Here's one 
way I thought about doing it:
  
data.by.genre1<-split(mini,mini$Genre1)
data.by.genre1$Animation
   Film    Genre1    Genre2 Genre3 Reviewer Rating Word Word_ct
1    Up Animation Adventure Comedy    Kelly      9  the    22
2    Up Animation Adventure Comedy    Kelly      9  dog     5
3    Up Animation Adventure Comedy    Kelly      9  can    10
4    Up Animation Adventure Comedy    Kelly      9  wow     2
11   Up Animation Adventure Comedy    Eddie      4  the    12
12   Up Animation Adventure Comedy    Eddie      4  dog     4
13   Up Animation Adventure Comedy    Eddie      4  can     2
14   Up Animation Adventure Comedy    Eddie      4  boy     2

data.ani=data.by.genre1$Animation
ani.wc<-sum(data.ani$V10)
ani.wc
[1] 59
miniplus=cbind(mini,ani.wc)
head(miniplus)
  Film    Genre1    Genre2 Genre3 Reviewer Rating  Word Word_ct ani.wc
1   Up Animation Adventure Comedy    Kelly      9   the    22     59
2   Up Animation Adventure Comedy    Kelly      9   dog     5     59
3   Up Animation Adventure Comedy    Kelly      9   can    10     59
4   Up Animation Adventure Comedy    Kelly      9   wow     2     59
5 Fame     Drama     Music   <NA>     Jenn      8   the    15     59
6 Fame     Drama     Music   <NA>     Jenn      8 dance    12     59

# That's one approach, but it seems REALLY inefficient since I would have to 
have add a new column specific to each genre, whether it was appropriate for 
the film or not. Also, sometimes a genre is listed as the first genre, 
sometimes as the second or third. I actually want to report the total of all 
"the"'s in Animation, whether Animation is listed as the first, second, or 
third genre. I'm not sure how to do that. 

# I've also tried to use the reshape library. That gets me a little closer to 
the kind of data I want, but I'm not really sure how to append it smartly. 

library(reshape)

# To use "melt", I seem to need to shave off the columns I'm not interested in 
(film title and reviewer).

mininew<-cbind(mini[2:4],mini[6:8])
head(melt(mininew,id=c("Word", "Genre1", "Genre2", "Genre3", "Rating"), 
measured=c("Word_Ct")))
    Word    Genre1    Genre2   Genre3 Rating variable value
1    the Animation Adventure   Comedy      9    Word_ct 22
2    dog Animation Adventure   Comedy      9    Word_ct 5
3    can Animation Adventure   Comedy      9    Word_ct 10
4    wow Animation Adventure   Comedy      9    Word_ct 2
5    the     Drama     Music     <NA>      8    Word_ct 15
6  dance     Drama     Music     <NA>      8    Word_ct 12

# This almost does what I want it to, except that it is only looking at the 
first genre tag--I'd like to combine all "the" for Animation, whether 
"Animation" is listed as the first, second, or third genre tag. 

# Using cast gets me even closer.  
cast(mininew, Genre1 ~ Word, sum)
Using Word_ct as value column.  Use the value argument to cast to override this 
choice
     Genre1 bank boy can dance dog great heist shit the wow
1 Animation    0   2  12     0   9     0     0    0  34   2
2     Crime    3   0   2     0   0     0     3    0  12   0
3     Drama    0   0   0    16   0     4     0    4  25   0

# Now that I know that there are 34 "the"'s in Animation, how do I relate that 
back to my main data set? In order to figure out whether words are over- or 
under-represented, I need to be able to figure out what we would have expected 
at chance and then compare the observed to the expected. Again, I was planning 
on using a g-test for this. And I was planning on reporting it per line, with 
each line being a word in a particular genre in a particular rating category 
(1-10 stars). 

# Again, my ultimate goal is to report observed and expected word counts for 
each word in each Genre in each Rating (1-10):
Word | Genre | Rating | Word_ct | Word_ct_in_genre | Word_ct_in_Rating | 
Expected_word_ct | G-test-score

Any suggestions would be greatly appreciated. 

Thanks,

Tyler





        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Transforming/appending data (words in IMDB)

Reply via email to