WinXP, R-2.9.1
Thanks to Don for the suggestion and especially to Phil for the perfect and elegant solution! It shows that R is a piece of miracle, when in the hands of those who really know how to work it. Since my originally request turned out to be easy to solve, let me move one step back in the process. The original file to be analyzed is constructed from ISI web of science output. The output looks as follows: >>>>> FN ISI Export Format VR 1.0 PT J TI Unmixing for Race Making in Brazil AU Bailey, SR SO AMERICAN JOURNAL OF SOCIOLOGY VL 114 IS 3 BP 577 EP 614 PY 2008 TC 0 AB This article analyzes race-targeted policy in Brazil as both a political stake and a powerful instrument in an unfolding classificatory struggle over the definition of racial boundaries. The Brazilian state traditionally embraced mixed-race classification, but is adopting racial quotas employing a black/ white scheme. To explore potential consequences of that turn for beneficiary identification and boundary formation, the author analyzes attitudinal survey data on race-targeted policy and racial classification in multiple formats, including classification in comparison to photographs. The results show that almost half of the mixed-race sample, when constrained to dichotomous classification, opts for whiteness, a majority rejects mixed-race individuals for quotas, and the mention of quotas for blacks in a split-ballot experiment nearly doubles the percentage choosing that racial category. Theories of how states make race emphasize the use of official categories to legislate exclusion. In contrast, analysis of the Brazilian case illuminates how states may also make race through policies of official inclusion. UT WOS:000262893500001 SN 0002-9602 ER PT J TI A Preference-Opportunity-Choice Framework with Applications to Intergroup Friendship AU Zeng, Z Xie, Y SO AMERICAN JOURNAL OF SOCIOLOGY VL 114 IS 3 BP 615 EP 648 PY 2008 TC 0 AB A long-standing objective of friendship research is to identify the effects of personal preference and structural opportunity on intergroup friendship choice. Although past studies have used various methods to separate preference from opportunity, researchers have not yet systematically compared the properties and implications of these methods. This study puts forward a general framework for discrete choice, where choice probability is specified as proportional to the product of preference and opportunity. To implement this framework, the authors propose a modification to the conditional logit model for estimating preference parameters free from the influence of opportunity structure and then compare this approach to several alternative methods for separating preference and opportunity used in the friendship choice literature. As an empirical example, the authors test hypotheses of homophily and status asymmetry in friendship choice using data from the National Longitudinal Study of Adolescent Health. The example also demonstrates the approach of conducting a sensitivity analysis to examine how parameter estimates vary by specification of the opportunity structure. UT WOS:000262893500002 SN 0002-9602 ER PT J TI The Ethnic Roots of Class Universalism: Rethinking the "Russian" Revolutionary Elite AU Riga, L SO AMERICAN JOURNAL OF SOCIOLOGY VL 114 IS 3 BP 649 EP 705 PY 2008 TC 0 AB This article retrieves the ethnic roots that underlie a universalist class ideology. Focusing empirically on the emergence of Bolshevism, it provides biographical analysis of the Russian Revolution's elite, finding that two-thirds were ethnic minorities from across the Russian Empire. After exploring class and ethnicity as intersectional experiences of varying significance to the Bolsheviks' revolutionary politics, this article suggests that socialism's class universalism found affinity with those seeking secularism in response to religious tensions, a universalist politics where ethnic violence and sectarianism were exclusionary, and an ethnically neutral and tolerant "imperial" imaginary where Russification and geopolitics were particularly threatening or imperial cultural frameworks predominated. The claim is made that socialism's class universalism was as much a product of ethnic particularism as it was constituted by it. UT WOS:000262893500003 SN 0002-9602 ER EF >>>>>>>>>> I only want to maintain the title (after TI) and abstract (after AB). All the rest needs to go. So, the final text file needs to be as follows: >>>>>>>>>>>>>> Unmixing for Race Making in Brazil This article analyzes race-targeted policy in Brazil as both a political stake and a powerful instrument in an unfolding classificatory struggle over the definition of racial boundaries. The Brazilian state traditionally embraced mixed-race classification, but is adopting racial quotas employing a black/ white scheme. To explore potential consequences of that turn for beneficiary identification and boundary formation, the author analyzes attitudinal survey data on race-targeted policy and racial classification in multiple formats, including classification in comparison to photographs. The results show that almost half of the mixed-race sample, when constrained to dichotomous classification, opts for whiteness, a majority rejects mixed-race individuals for quotas, and the mention of quotas for blacks in a split-ballot experiment nearly doubles the percentage choosing that racial category. Theories of how states make race emphasize the use of official categories to legislate exclusion. In contrast, analysis of the Brazilian case illuminates how states may also make race through policies of official inclusion. A Preference-Opportunity-Choice Framework with Applications to Intergroup Friendship A long-standing objective of friendship research is to identify the effects of personal preference and structural opportunity on intergroup friendship choice. Although past studies have used various methods to separate preference from opportunity, researchers have not yet systematically compared the properties and implications of these methods. This study puts forward a general framework for discrete choice, where choice probability is specified as proportional to the product of preference and opportunity. To implement this framework, the authors propose a modification to the conditional logit model for estimating preference parameters free from the influence of opportunity structure and then compare this approach to several alternative methods for separating preference and opportunity used in the friendship choice literature. As an empirical example, the authors test hypotheses of homophily and status asymmetry in friendship choice using data from the National Longitudinal Study of Adolescent Health. The example also demonstrates the approach of conducting a sensitivity analysis to examine how parameter estimates vary by specification of the opportunity structure. The Ethnic Roots of Class Universalism: Rethinking the "Russian" Revolutionary Elite This article retrieves the ethnic roots that underlie a universalist class ideology. Focusing empirically on the emergence of Bolshevism, it provides biographical analysis of the Russian Revolution's elite, finding that two-thirds were ethnic minorities from across the Russian Empire. After exploring class and ethnicity as intersectional experiences of varying significance to the Bolsheviks' revolutionary politics, this article suggests that socialism's class universalism found affinity with those seeking secularism in response to religious tensions, a universalist politics where ethnic violence and sectarianism were exclusionary, and an ethnically neutral and tolerant "imperial" imaginary where Russification and geopolitics were particularly threatening or imperial cultural frameworks predominated. The claim is made that socialism's class universalism was as much a product of ethnic particularism as it was constituted by it. <<<<<<<<<<<<<< Preferably even with the title in 1 line, the abstract as 1 line, the next title as 1 line, etc. (the "1 line" is preferable because I would like to then calculate coword occurence within titles or abstracts (so, across the abstract as a whole. Since the original ISI web of science file breaks the titles and abstracts up into multiple lines, I need to reconfigure them as 1 line each, so I can determine cowords within that line and thus within each title or abstract). Considering the elegance of Phil's solution, I am hoping there is a straightforward way to do this in R as well. It would save me from having to go through many many lines manually. thanks, Peter Verbeet <-----Original Message-----> >From: Don MacQueen [m...@llnl.gov] >Sent: 7/2/2009 10:53:07 PM >To: helter...@care2.com;r-help@r-project.org >Subject: Re: [R] working with texts > > From the CRAN task view on natural language processing: > >(package) tm provides a comprehensive text mining framework for R. >The Journal of Statistical Software article Text Mining >Infrastructure in R gives a detailed overview and presents >techniques for count-based analysis methods, text clustering, text >classification and string kernels. > >Worth looking into? > >-Don > >At 12:59 PM -0700 7/2/09, Helter Two wrote: >>WinXP, R-2.9.1 >> >>LS., >> >>I have been trying to solve a (for me) tricky issue. No matter what I've >>tried, I just can't find a way to do this. >>This is the issue: >> >>I have a text file (ansi text) "titles.txt" with lines of text; here is >>an example of such a file: >> >>>>>>> >>a brief history of polio vaccines >>anti-vaccination movements and their interpretations >>early warning in the light of theories of technological change >>international mobility among nordic doctoral students >>land of hope and glory: exploring cochlear implantation in the >>netherlands >>making science - between nature and society >>medical innovations in historical-perspective >>photographing medicine - images and power in britain and america since >>1840 >>shifts in global immunisation goals (1984-2004): unfinished agendas and >>mixed results >>striking the mother lode in science - the importance of age, place, and >>time >>technology assessment and the sociopolitics of health technologies >>the policy of science and technology - evolution of research policy - >>france, the united-kingdom, the federal-republic-of-germany, japan, the >>united-states - french >>vaccine independence, local competences and globalisation: lessons from >>the history of pertussis vaccines >>external assessment and conditional financing of research in dutch >>universities >>histories of cochlear implantation >>lock in, the state and vaccine development: lessons from the history of >>the polio vaccines >>peerless science - peer-review and united-states science policy >>technology, science, and obstetric practice - the origins and >>transformation of cephalopelvimetry >>the rhetoric and counter-rhetoric of a ''bionic'' technology >>vaccine innovation and adoption: polio vaccines in the uk, the >>netherlands and west germany, 1955-1965 >><<<<< >> >>Some of the lines in such a file are very long (not in this example). >>The file contains titles and abstracts of scientific articles. >> >>In addition to this file, I also have a file "words.txt" that includes a >>set of words I want to analyze. Part of this file: >> >>>>> >>technology >>technological >>innovations >>science >>policy >>society >>history >><<<<< >> >>What I want is to create a matrix in which cell [i,j] contains the >>number of times word i (i.e the ith word from "words.txt") appears in >>line j of "titles.txt". >> >>So, for the data above this would yield (barring any typos on my side): >>0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 >>0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 >>0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 >>0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 0 0 0 >>0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 >> >>This is the precursor to co-word analysis and some basic statistics on >>these titles and abstracts. >>I have always had a hard time working with text in R and still have no >>idea how to achieve the results above. I am probably overlooking >>something pretty straightforward. But right now, I am completely in the >>dark. >> >>Any help is very much appreciated, >> >>Peter Verbeet >> >> >> [[alternative HTML version deleted]] >> >>______________________________________________ >>R-help@r-project.org mailing list >>https://*stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide http://*www.*R-project.org/posting-guide.html >>and provide commented, minimal, self-contained, reproducible code. > > >-- >-------------------------------------- >Don MacQueen >Environmental Protection Department >Lawrence Livermore National Laboratory >Livermore, CA, USA >925-423-1062 >-------------------------------------- >. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.