Re: [R] working with texts

Helter Two Mon, 06 Jul 2009 07:43:12 -0700

WinXP, R-2.9.1


Thanks to Don for the suggestion and especially to Phil for the perfect
and elegant solution! It shows that R is a piece of miracle, when in the
hands of those who really know how to work it.
Since my originally request turned out to be easy to solve, let me move
one step back in the process. The original file to be analyzed is
constructed from ISI web of science output. The output looks as follows:

>>>>>
FN ISI Export Format
VR 1.0
PT J
TI Unmixing for Race Making in Brazil
AU Bailey, SR
SO AMERICAN JOURNAL OF SOCIOLOGY
VL 114
IS 3
BP 577
EP 614
PY 2008
TC 0
AB This article analyzes race-targeted policy in Brazil as both a
political stake and a powerful instrument in an unfolding
classificatory struggle over the definition of racial boundaries. The
Brazilian state traditionally embraced mixed-race classification, but
is adopting racial quotas employing a black/ white scheme. To explore
potential consequences of that turn for beneficiary identification and
boundary formation, the author analyzes attitudinal survey data on
race-targeted policy and racial classification in multiple formats,
including classification in comparison to photographs. The results show
that almost half of the mixed-race sample, when constrained to
dichotomous classification, opts for whiteness, a majority rejects
mixed-race individuals for quotas, and the mention of quotas for blacks
in a split-ballot experiment nearly doubles the percentage choosing
that racial category. Theories of how states make race emphasize the
use of official categories to legislate exclusion. In contrast,
analysis of the Brazilian case illuminates how states may also make
race through policies of official inclusion.

UT WOS:000262893500001
SN 0002-9602
ER

PT J
TI A Preference-Opportunity-Choice Framework with Applications to
Intergroup Friendship
AU Zeng, Z
Xie, Y
SO AMERICAN JOURNAL OF SOCIOLOGY
VL 114
IS 3
BP 615
EP 648
PY 2008
TC 0
AB A long-standing objective of friendship research is to identify the
effects of personal preference and structural opportunity on intergroup
friendship choice. Although past studies have used various methods to
separate preference from opportunity, researchers have not yet
systematically compared the properties and implications of these
methods. This study puts forward a general framework for discrete
choice, where choice probability is specified as proportional to the
product of preference and opportunity. To implement this framework, the
authors propose a modification to the conditional logit model for
estimating preference parameters free from the influence of opportunity
structure and then compare this approach to several alternative methods
for separating preference and opportunity used in the friendship choice
literature. As an empirical example, the authors test hypotheses of
homophily and status asymmetry in friendship choice using data from the
National Longitudinal Study of Adolescent Health. The example also
demonstrates the approach of conducting a sensitivity analysis to
examine how parameter estimates vary by specification of the
opportunity structure.

UT WOS:000262893500002
SN 0002-9602
ER

PT J
TI The Ethnic Roots of Class Universalism: Rethinking the "Russian"
Revolutionary Elite
AU Riga, L
SO AMERICAN JOURNAL OF SOCIOLOGY
VL 114
IS 3
BP 649
EP 705
PY 2008
TC 0
AB This article retrieves the ethnic roots that underlie a universalist
class ideology. Focusing empirically on the emergence of Bolshevism, it
provides biographical analysis of the Russian Revolution's elite,
finding that two-thirds were ethnic minorities from across the Russian
Empire. After exploring class and ethnicity as intersectional
experiences of varying significance to the Bolsheviks' revolutionary
politics, this article suggests that socialism's class universalism
found affinity with those seeking secularism in response to religious
tensions, a universalist politics where ethnic violence and
sectarianism were exclusionary, and an ethnically neutral and tolerant
"imperial" imaginary where Russification and geopolitics were
particularly threatening or imperial cultural frameworks predominated.
The claim is made that socialism's class universalism was as much a
product of ethnic particularism as it was constituted by it.

UT WOS:000262893500003
SN 0002-9602
ER

EF
>>>>>>>>>>

I only want to maintain the title (after TI) and abstract (after AB).
All the rest needs to go. So, the final text file needs to be as
follows:

>>>>>>>>>>>>>>
Unmixing for Race Making in Brazil
This article analyzes race-targeted policy in Brazil as both a
political stake and a powerful instrument in an unfolding
classificatory struggle over the definition of racial boundaries. The
Brazilian state traditionally embraced mixed-race classification, but
is adopting racial quotas employing a black/ white scheme. To explore
potential consequences of that turn for beneficiary identification and
boundary formation, the author analyzes attitudinal survey data on
race-targeted policy and racial classification in multiple formats,
including classification in comparison to photographs. The results show
that almost half of the mixed-race sample, when constrained to
dichotomous classification, opts for whiteness, a majority rejects
mixed-race individuals for quotas, and the mention of quotas for blacks
in a split-ballot experiment nearly doubles the percentage choosing
that racial category. Theories of how states make race emphasize the
use of official categories to legislate exclusion. In contrast,
analysis of the Brazilian case illuminates how states may also make
race through policies of official inclusion.
A Preference-Opportunity-Choice Framework with Applications to
Intergroup Friendship
A long-standing objective of friendship research is to identify the
effects of personal preference and structural opportunity on intergroup
friendship choice. Although past studies have used various methods to
separate preference from opportunity, researchers have not yet
systematically compared the properties and implications of these
methods. This study puts forward a general framework for discrete
choice, where choice probability is specified as proportional to the
product of preference and opportunity. To implement this framework, the
authors propose a modification to the conditional logit model for
estimating preference parameters free from the influence of opportunity
structure and then compare this approach to several alternative methods
for separating preference and opportunity used in the friendship choice
literature. As an empirical example, the authors test hypotheses of
homophily and status asymmetry in friendship choice using data from the
National Longitudinal Study of Adolescent Health. The example also
demonstrates the approach of conducting a sensitivity analysis to
examine how parameter estimates vary by specification of the
opportunity structure.
The Ethnic Roots of Class Universalism: Rethinking the "Russian"
Revolutionary Elite
This article retrieves the ethnic roots that underlie a universalist
class ideology. Focusing empirically on the emergence of Bolshevism, it
provides biographical analysis of the Russian Revolution's elite,
finding that two-thirds were ethnic minorities from across the Russian
Empire. After exploring class and ethnicity as intersectional
experiences of varying significance to the Bolsheviks' revolutionary
politics, this article suggests that socialism's class universalism
found affinity with those seeking secularism in response to religious
tensions, a universalist politics where ethnic violence and
sectarianism were exclusionary, and an ethnically neutral and tolerant
"imperial" imaginary where Russification and geopolitics were
particularly threatening or imperial cultural frameworks predominated.
The claim is made that socialism's class universalism was as much a
product of ethnic particularism as it was constituted by it.
<<<<<<<<<<<<<<

Preferably even with the title in 1 line, the abstract as 1 line, the
next title as 1 line, etc.
(the "1 line" is preferable because I would like to then calculate
coword occurence within titles or abstracts (so, across the abstract as
a whole. Since the original ISI web of science file breaks the titles
and abstracts up into multiple lines, I need to reconfigure them as 1
line each, so I can determine cowords within that line and thus within
each title or abstract). 

Considering the elegance of Phil's solution, I am hoping there is a
straightforward way to do this in R as well. It would save me from
having to go through many many lines manually.

thanks, 
Peter Verbeet



<-----Original Message-----> 
>From: Don MacQueen [m...@llnl.gov]
>Sent: 7/2/2009 10:53:07 PM
>To: helter...@care2.com;r-help@r-project.org
>Subject: Re: [R] working with texts
>
> From the CRAN task view on natural language processing:
>
>(package) tm provides a comprehensive text mining framework for R. 
>The Journal of Statistical Software article Text Mining 
>Infrastructure in R gives a detailed overview and presents 
>techniques for count-based analysis methods, text clustering, text 
>classification and string kernels.
>
>Worth looking into?
>
>-Don
>
>At 12:59 PM -0700 7/2/09, Helter Two wrote:
>>WinXP, R-2.9.1
>>
>>LS.,
>>
>>I have been trying to solve a (for me) tricky issue. No matter what
I've
>>tried, I just can't find a way to do this.
>>This is the issue:
>>
>>I have a text file (ansi text) "titles.txt" with lines of text; here
is
>>an example of such a file:
>>
>>>>>>>
>>a brief history of polio vaccines
>>anti-vaccination movements and their interpretations
>>early warning in the light of theories of technological change
>>international mobility among nordic doctoral students
>>land of hope and glory: exploring cochlear implantation in the
>>netherlands
>>making science - between nature and society
>>medical innovations in historical-perspective
>>photographing medicine - images and power in britain and america since
>>1840
>>shifts in global immunisation goals (1984-2004): unfinished agendas
and
>>mixed results
>>striking the mother lode in science - the importance of age, place,
and
>>time
>>technology assessment and the sociopolitics of health technologies
>>the policy of science and technology - evolution of research policy -
>>france, the united-kingdom, the federal-republic-of-germany, japan,
the
>>united-states - french
>>vaccine independence, local competences and globalisation: lessons
from
>>the history of pertussis vaccines
>>external assessment and conditional financing of research in dutch
>>universities
>>histories of cochlear implantation
>>lock in, the state and vaccine development: lessons from the history
of
>>the polio vaccines
>>peerless science - peer-review and united-states science policy
>>technology, science, and obstetric practice - the origins and
>>transformation of cephalopelvimetry
>>the rhetoric and counter-rhetoric of a ''bionic'' technology
>>vaccine innovation and adoption: polio vaccines in the uk, the
>>netherlands and west germany, 1955-1965
>><<<<<
>>
>>Some of the lines in such a file are very long (not in this example).
>>The file contains titles and abstracts of scientific articles.
>>
>>In addition to this file, I also have a file "words.txt" that includes
a
>>set of words I want to analyze. Part of this file:
>> >>>>>
>>technology
>>technological
>>innovations
>>science
>>policy
>>society
>>history
>><<<<<
>>
>>What I want is to create a matrix in which cell [i,j] contains the
>>number of times word i (i.e the ith word from "words.txt") appears in
>>line j of "titles.txt".
>>
>>So, for the data above this would yield (barring any typos on my
side):
>>0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0
>>0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
>>0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0
>>0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0
>>0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>>1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
>>
>>This is the precursor to co-word analysis and some basic statistics on
>>these titles and abstracts.
>>I have always had a hard time working with text in R and still have no
>>idea how to achieve the results above. I am probably overlooking
>>something pretty straightforward. But right now, I am completely in
the
>>dark.
>>
>>Any help is very much appreciated,
>>
>>Peter Verbeet
>>
>>
>> [[alternative HTML version deleted]]
>>
>>______________________________________________
>>R-help@r-project.org mailing list
>>https://*stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
http://*www.*R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>
>
>-- 
>--------------------------------------
>Don MacQueen
>Environmental Protection Department
>Lawrence Livermore National Laboratory
>Livermore, CA, USA
>925-423-1062
>--------------------------------------
>.
> 

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] working with texts

Reply via email to