On 27/09/2015 7:56 AM, Luigi Marongiu wrote: > Dear all, > I am reading a txt file into the R environment to create a data frame, > however I have notice that some entries have a truncated version of a > field, so for instance I get "Astro" instead of "Astro 1-Astro 1" and > "Sapo" for "Sapo #1-Sapo_1" and "Sapo #2-Sapo_2", but I also get > "Adeno 40/41 EH-Adeno_40-41_EH" so the problem is not in the spaces > between the words. The txt file is a simple tab delimited file > generated from excel which I read with: > > bad.data<-read.table( > "test_df.txt", > header=TRUE, > row.names=1, > dec = ".", > sep="\t", > stringsAsFactors = FALSE, > fill = TRUE > ) > > [the fill = TRUE was introduced because in the real case I got an > error of a missing line.]
See the "comment.char" argument to read.table. By default the "#" character marks a comment, as in R code. Duncan Murdoch > > I can recreate this file as follows: > sample <- c(rep("p.001", 48), rep("p.547", 48)) > target <- c("Adeno 1-Adeno 1", "Adeno 40/41 EH-AIQJCT3", "Astro > 1-Astro 1", "Sapo 1-Sapo 1", "Sapo 2-Sapo 2", "Enterovirus > 1-Enterovirus 1", "Parechovirus-Parechovirus", "HEV 1-HEV 1", > "IC PDV control-AIRSA0B", "Rotavirus cam-Rotavirus cam", > "18S-Hs99999901_s1", "Noro gp II-Noro gp II", "Noro gp 1-Noro gp > 1", "Noro gp 1 mod33-Noro gp 1 mod33", "C difficile > GDH-AIS086J", "C difficile Tox B-C difficile Tox B", "VTX > 1-AIT97CR", "BT control Man-AIVI5IZ", "E. coli vtx 2-E. coli vtx > 2", "Campy spp-AIWR3O7", "Salmonella ttr-AIX01VF", "Crypto > CP2-AIY9Z1N", "Green Fluorescent Protein-AI0IX7V", "Adeno > 2-Adeno 2", "Adeno 40_41 Oly-AI1RWD3", "Astro 2 Liu-AI20UKB", > "Giardia lambia 1-AI39SQJ", "Rotavirus Liu-Rotavirus Liu 2", > "Enterovirus Bruges-Enterovirus 2 Br", "HAV 1-Hepatitis A 1", > "HEV 2-AI5IQWR", "MS2 control-AI6RO2Z", "Rotarix NSP2-AI70M87", > "CMV br-CMV br", "IC Rnase P-AI89LFF", "Salmonella hil > A-Salmonella hil A", "Shigella ipa H-AIAA0K8", "Enteroagg E. > coli-AIBJYRG", "Campy jejuni-AICSWXO", "Campy coli-AID1U3W", > "Yersinia enterocolitica-AIFAS94", "Bacterial 16S-Bacterial 16S", > "Aeromonas hydrophilia-Aeromonas hydrophilia", "V > cholerae-AIGJRGC", "Dientamoeba fragilis-AIHSPMK", "Entamoeba > histolytica-AII1NSS", "Crypto 2 J-AIKALY0", "Giardia lambia > rev-AILJJ48", "Adeno #1-Adeno_1", "Adeno 40/41 > EH-Adeno_40-41_EH", "Astro #1-Astro_1", "Sapo #1-Sapo_1", > "Sapo #2-Sapo_2", "Enterovirus #1-Enterovirus_1", > "Parechovirus-Parechovirus", "HEV #1-HEV_1", "C coli jejuni > Liu-C_coli_jejuni_Li", "Rotavirus cam-Rotavirus_cam", "IC 18s-IC > 18s", "Noro gp II-Noro_gp_II", "Noro gp 1-Noro_gp_1", "Noro > gp 1 mod33-Noro_gp_1_mod33", "C difficile GDH-C-difficile_GDH", > "C difficile Tox B-C_difficile_T_B", "E. coli vtx 1-E_coli_vtx_1", > "BT control Man-BT_control_Man", "E. coli vtx 2-E_coli_vtx_2", > "Campy spp NEW-Campy_spp_NEW", "Salmonella ttr-Salmonella_ttr", > "Cryptosporidium spp CP2-Cryptos_spp_CP2", "C jejuni > #2-C_jejuni_2", "Adeno #2-Adeno_2", "Adeno 40/41 > Oly-Adeno_40-41_Oly", "Astro Liu #2-Astro_Liu_2", "Giardia > lambia #1-Giardia_lambia_1", "Rotavirus Liu #2-Rotavirus_Liu_2", > "Enterovirus #2 Br-Enterovirus_2_Br", "Hepatitis A > #1-Hepatitis_A_1", "HEV #2-HEV_2", "MS2 control-MS2_control", > "Rotarix NSP2 Bris-Rotarix_NSP2_Bri", "CMV br-CMV_br", "Rnase P > control-Rnase_P_control", "Salmonella hil A-Salmonella_hil_A", > "Shigella ipa H-Shigella_ipa_H", "Enteroagg E. > coli-Enteroagg_E_coli", "V parahaemolyticus-V_p_haemolyticus", > "Campy coli-Campy_coli", "Yersinia > enterocolitica-Y_enterocolitica", "Bacterial 16S-Bacterial_16S", > "Aeromonas hydrophilia-Aero_hydrophilia", "Vibrio > cholerae-Vibrio_cholerae", "Dientamoeba fragilis-Dien_fragilis", > "Entamoeba histolytica-Enta_histolytica", "Cryptosporidium spp #2 > J-Crypto_spp_2_J", "Giardia lambia #2 rev-Giardia_lambia_r") > ct <- c(NA, NA, NA, NA, NA, NA, NA, NA, NA, > NA, 18.793, NA, NA, NA, NA, NA, NA, 33.302, > NA, 32.388, NA, NA, NA, NA, NA, NA, NA, NA, > NA, NA, NA, 31.398, NA, NA, NA, NA, NA, > NA, NA, NA, NA, 8.115, NA, NA, NA, NA, NA, > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, > NA, 21.161, NA, NA, NA, NA, NA, NA, 31.302, > NA, 29.785, NA, NA, NA, NA, NA, NA, NA, > NA, NA, NA, NA, 31.212, 42.967, NA, 33.503, > NA, NA, NA, NA, NA, NA, 9.584, NA, NA, NA, > NA, NA, NA) > > good.data <- data.frame(sample, target, ct, stringsAsFactors = FALSE) > > and the structure of these object is the same: >> str(good.data) > 'data.frame': 96 obs. of 3 variables: > $ sample: chr "p.001" "p.001" "p.001" "p.001" ... > $ target: chr "Adeno 1-Adeno 1" "Adeno 40/41 EH-AIQJCT3" "Astro > 1-Astro 1" "Sapo 1-Sapo 1" ... > $ ct : num NA NA NA NA NA NA NA NA NA NA ... >> str(bad.data) > 'data.frame': 96 obs. of 3 variables: > $ Sample: chr "p.001" "p.001" "p.001" "p.001" ... > $ Target: chr "Adeno 1-Adeno 1" "Adeno 40/41 EH-AIQJCT3" "Astro > 1-Astro 1" "Sapo 1-Sapo 1" ... > $ Ct : num NA NA NA NA NA NA NA NA NA NA ... > > however in the good.data case the problem with truncation does not > occur, so for instance I get the required "Astro #1-Astro_1", "Sapo > #1-Sapo_1" and "Sapo #2-Sapo_2 ". > The problem must therefore be in the format of the txt file and the > read function, possibly in the # character present in the names. > Could somebody explain me what such problem is and how to avoid it? > Many thanks > Best regards > Luigi > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.