[R] working with texts

Helter Two Thu, 02 Jul 2009 13:03:13 -0700

WinXP, R-2.9.1

LS.,


I have been trying to solve a (for me) tricky issue. No matter what I've
tried, I just can't find a way to do this. 
This is the issue:

I have a text file (ansi text) "titles.txt" with lines of text; here is
an example of such a file:

>>>>>
a brief history of polio vaccines
anti-vaccination movements and their interpretations
early warning in the light of theories of technological change
international mobility among nordic doctoral students
land of hope and glory: exploring cochlear implantation in the
netherlands
making science - between nature and society
medical innovations in historical-perspective
photographing medicine - images and power in britain and america since
1840
shifts in global immunisation goals (1984-2004): unfinished agendas and
mixed results
striking the mother lode in science - the importance of age, place, and
time
technology assessment and the sociopolitics of health technologies
the policy of science and technology - evolution of research policy -
france, the united-kingdom, the federal-republic-of-germany, japan, the
united-states - french
vaccine independence, local competences and globalisation: lessons from
the history of pertussis vaccines
external assessment and conditional financing of research in dutch
universities
histories of cochlear implantation
lock in, the state and vaccine development: lessons from the history of
the polio vaccines
peerless science - peer-review and united-states science policy
technology, science, and obstetric practice - the origins and
transformation of cephalopelvimetry
the rhetoric and counter-rhetoric of a ''bionic'' technology
vaccine innovation and adoption: polio vaccines in the uk, the
netherlands and west germany, 1955-1965
<<<<<

Some of the lines in such a file are very long (not in this example).
The file contains titles and abstracts of scientific articles.

In addition to this file, I also have a file "words.txt" that includes a
set of words I want to analyze. Part of this file:
>>>>>
technology
technological
innovations
science
policy
society
history
<<<<<

What I want is to create a matrix in which cell [i,j] contains the
number of times word i (i.e the ith word from "words.txt") appears in
line j of "titles.txt".

So, for the data above this would yield (barring any typos on my side):
0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0

This is the precursor to co-word analysis and some basic statistics on
these titles and abstracts.
I have always had a hard time working with text in R and still have no
idea how to achieve the results above. I am probably overlooking
something pretty straightforward. But right now, I am completely in the
dark. 

Any help is very much appreciated,

Peter Verbeet


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] working with texts

Reply via email to