Re: [R] Fwd: sqldf not joining all the fields

David Winsemius Fri, 12 Mar 2010 11:52:08 -0800


On Mar 12, 2010, at 2:15 PM, Natalie Van Zuydam wrote:

Hi,
I did quote the string and as I told read.table to strip.white Ialso tried
it with no spaces.
"ASPIRIN DISP AAH" %in% tsf_data$name
[1] FALSE

"ASPIRINDISPAAH" %in% tsf_data$name
[1] FALSE
I have looked at the last lines of my y_data object and there is aproblem
with the file that I'm going to try to use perl to sort out


> max(tsf_data$item_code)
[1] 123612
Looking at the Firefox version of for this value I see line starting:
123612|MINIJET ADRENALINE 1.5"|INJ|100MCG/ML|2.7.3
... has unmatched '"'

--
David.

if I y_data[2847:2848,]
RONIC ACID|TABS|5MG|6.6.2\n652903|ALENDRONIC ACID Once
Wee|TABS|70MG|1.1.1\n653000|AZATHIOPRINE|INJ|50MG|8.2.1\n653200|DORZOLAMIDEEYE|DROPS|2%|11.6\n653500|TOPIRAMATE|TABS|50MG|4.8.1\n653510|TOPIRAMATE|TABS|100MG|4.8.1\n653513|TOPIRAMATE|TABS|25MG|4.8.1\n653600|GUAR
GUM
SF|SACH|5G|6.1.2.3\n653700|TACALCITOL|OINT|NULL|13.5.2\n654000|COCODAMOL|CAPS|30/500MG|4.7.1\n654010|COCODAMOL|TABS|12.8/500M|4.7.1\n654020|COCODAMOL|SACH|30/500MG|4.7.1\n654300|ACAMPROSATECALCIUM|TABS|333MG|4.10\n654400|ACECLOFENAC|TABS|100MG|10.1.1\n654500|TILUDRONIC
ACID|TABS|200MG|6.6.2\n654600|TAMSULOSIN HCL
MR|CAPS|400MCG|7.4.1\n654800|PENCICLOVIR|CREAM|1%|13.10.3\......
    ........formulation_code strength bnf_code
2847              INJ   1MG/ML    3.4.3
2848
The output looks like this so it must be the input files and not ansqldf
problem.  I didn't see this in the script editor I was using.

Natalie
On Fri, Mar 12, 2010 at 7:05 PM, Dennis Murphy <[email protected]>wrote:
Hi:
Part of the problem is that your string contains spaces; in yourexample
below, you would need to quote the string, I believe, as in

"ASPIRIN DISP AAH" %in% y$name                        [untested...]

HTH,
Dennis


On Fri, Mar 12, 2010 at 10:59 AM, Natalie Van Zuydam <[email protected]
wrote:
842752|ASPIRIN DISP AAH|TABS|75MG|2.9 is taken directly from myy_data
text
file.
If I search for ASPIRIN DISP AAH %in% y$name I get: FALSE.Despite thefact that it is there in the text file that I loaded into y. Theremust be
a
problem with my input.
tsf_data <- read.table("feb09_267_presc_items_tsf.txt", header =TRUE, sep
=
"|", quote = "\"'",
dec = ".",as.is = TRUE,na.strings = "NA",colClasses= NA,
nrows = 3864284,
              skip = 0, check.names = TRUE,fill=TRUE,
              strip.white = TRUE, blank.lines.skip = TRUE,
comment.char = "#", allowEscapes = FALSE, flush =FALSE,
              fileEncoding = "", encoding = "unknown")
Would fill=TRUE and strip.white=TRUE affect how the item_codecolumns arematched? When I look at the file in a script editor I cannot seeanystrange symbols or formatting? I have had to use fill otherwise Iget the
following error message:

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines,na.strings, :
number of items read is not a multiple of the number of columns
Not sure if correcting this in read.table is affecting the joinfunction
of
sqldb?

Thanks,
Natalie



I also tried 842752

On Fri, Mar 12, 2010 at 6:05 PM, Gabor Grothendieck <
[email protected]
wrote:
That is not so. 842752 does not exist in y$item_code and ASPIRINhas
a code of 22730.
842752 %in% y$item_code
[1] FALSE
subset(y, name == "ASPIRIN")
  item_code    name formulation_code strength bnf_code
850     22730 ASPIRIN             TABS    300MG    4.7.1
855     22780 ASPIRIN              PDR     NULL    4.7.1
856     22790 ASPIRIN             MIXT        $    4.7.1


On Fri, Mar 12, 2010 at 12:51 PM, Natalie Van Zuydam
<[email protected]> wrote:
---------- Forwarded message ----------
From: Natalie Van Zuydam <[email protected]>
Date: Fri, Mar 12, 2010 at 5:49 PM
Subject: Re: [R] sqldf not joining all the fields
To: David Winsemius <[email protected]>


Dear David

I'm not sure what the problem is as for every item code there is a
corresponding information in the y_data. For example 842752from the
x_data
corresponds to Aspirin in the y_data? Yet when I use sqldf tojoin
the
two
df's I get NA values in the columns from the y_data in z for842752
item
code....is there something wrong with my sqldf code or somethingwrong
with
the way I have inputed the data frames?

Thanks for taking the time to help me,
Natalie



On Fri, Mar 12, 2010 at 5:42 PM, David Winsemius <
[email protected]
wrote:
If I assign the file input to y_data and change you sqldf to
z  <- sqldf("select * from x_data left join y_data using
(item_code)"); z
I can replicate your result. Even after changing the types ofthe twoitem_code fields to match I still get the same result and whenI see
to
what
degree they share values I get:
sum(x_data$item_code %in% y_data$item_code)
[1] 2
sum(y_data$item_code %in% x_data$item_code)
[1] 2


So why are you so sure they are "complete" as you claimed in your
first
email.

--
David.



On Mar 12, 2010, at 12:29 PM, David Winsemius wrote:
You have now given two different assignments to x_data and noneto
y_data:
The str( from the file access offering:
str(x_data)
'data.frame':   2848 obs. of  5 variables:
$ item_code : int 100 110 150 160 161 164 200 210 212220 ...
$ name            : chr  "NEONACLEX K" "NEONACLEX" "MESORB"
"ABSORBENT
CELLULOSE MESO" ...
$ formulation_code: chr  "TABS" "TABS" "DRESS" "DRESS" ...
$ strength : chr "NULL" "5MG" "10CMX10CM""10CMX10CM" ...
$ bnf_code        : chr  "2.2.8" "2.2.1" "20.3.1" "20.3.1" ...

The str from assignment from the dput offering
str(x_data)
'data.frame':   10 obs. of  10 variables:
$ prochi         : chr  "CAO0000713" "CAO0000713" "CAO0000713"
"CAO0000713" ...
$ prescribed_date: chr  "22/06/2001" "28/04/2000" "10/04/2000"
"03/07/2000" ...
$ dataMonth      : chr  "NULL" "NULL" "NULL" "NULL" ...
$ item_code      : chr  "842752" "7800" "842652" "842652" ...
$ res_seqno      : chr  "NULL" "NULL" "NULL" "NULL" ...
$ quantity       : chr  "60" "100G" "60" "60" ...
$ directions     : chr  "1/D" "A/TD" "1/D" "1/D" ...
$ no_of_packs    : chr  "NULL" "NULL" "NULL" "NULL" ...
$ datasource     : chr  "TSF" "TSF" "TSF" "TSF" ...
$ scan_ref_no    : chr  "NULL" "NULL" "NULL" "NULL" ...

This code "worked", but it is not clear that the x-y assignments
were
correct:

x_data <- read.table(file="
http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt",
header
= TRUE, sep = "|", quote = "\"'",
dec = ".",as.is = TRUE,na.strings ="NA",colClasses =
NA,
nrows = 3864284,
            skip = 0, check.names = TRUE,fill=TRUE,
            strip.white = TRUE, blank.lines.skip = TRUE,
            comment.char = "#", allowEscapes = FALSE, flush =
FALSE,
            fileEncoding = "", encoding = "unknown")

--
David.

On Mar 12, 2010, at 12:23 PM, Newbie19_02 wrote:
The y_data file has over 9000 rows in it so I thought itwould be
more
practical to give you the file to download....
--
View this message in context:
http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590833.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproduciblecode.
David Winsemius, MD
West Hartford, CT

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproduciblecode.
David Winsemius, MD
West Hartford, CT


David Winsemius, MD
West Hartford, CT

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Fwd: sqldf not joining all the fields

Reply via email to