Re: [R] Fwd: sqldf not joining all the fields

Natalie Van Zuydam Fri, 12 Mar 2010 11:01:33 -0800

842752|ASPIRIN DISP AAH|TABS|75MG|2.9  is taken directly from my y_data text
file.


If I search for ASPIRIN DISP AAH %in% y$name  I get: FALSE.  Despite the
fact that it is there in the text file that I loaded into y. There must be a
problem with my input.

tsf_data <- read.table("feb09_267_presc_items_tsf.txt", header = TRUE, sep =
"|", quote = "\"'",
                dec = ".",as.is = TRUE,na.strings = "NA",colClasses = NA,
nrows = 3864284,
                skip = 0, check.names = TRUE,fill=TRUE,
                strip.white = TRUE, blank.lines.skip = TRUE,
                comment.char = "#", allowEscapes = FALSE, flush = FALSE,
                fileEncoding = "", encoding = "unknown")

Would fill=TRUE and strip.white=TRUE affect how the item_code columns are
matched?  When I look at the file in a script editor I cannot see any
strange symbols or formatting?  I have had to use fill otherwise I get the
following error message:

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  number of items read is not a multiple of the number of columns

Not sure if correcting this in read.table is affecting the join function of
sqldb?

Thanks,
Natalie



I also tried 842752

On Fri, Mar 12, 2010 at 6:05 PM, Gabor Grothendieck <[email protected]
> wrote:

> That is not so.  842752 does not exist in y$item_code and ASPIRIN has
> a code of 22730.
>
> > 842752 %in% y$item_code
> [1] FALSE
>
> > subset(y, name == "ASPIRIN")
>    item_code    name formulation_code strength bnf_code
> 850     22730 ASPIRIN             TABS    300MG    4.7.1
> 855     22780 ASPIRIN              PDR     NULL    4.7.1
> 856     22790 ASPIRIN             MIXT        $    4.7.1
>
>
> On Fri, Mar 12, 2010 at 12:51 PM, Natalie Van Zuydam
> <[email protected]> wrote:
> > ---------- Forwarded message ----------
> > From: Natalie Van Zuydam <[email protected]>
> > Date: Fri, Mar 12, 2010 at 5:49 PM
> > Subject: Re: [R] sqldf not joining all the fields
> > To: David Winsemius <[email protected]>
> >
> >
> > Dear David
> >
> > I'm not sure what the problem is as for every item code there is a
> > corresponding information in the y_data.  For example 842752 from the
> x_data
> > corresponds to Aspirin in the y_data?  Yet when I use sqldf to join the
> two
> > df's I get NA values in the columns from the y_data in z for 842752 item
> > code....is there something wrong with my sqldf code or something wrong
> with
> > the way I have inputed the data frames?
> >
> > Thanks for taking the time to help me,
> > Natalie
> >
> >
> >
> > On Fri, Mar 12, 2010 at 5:42 PM, David Winsemius <[email protected]
> >wrote:
> >
> >> If I assign the file input to y_data and change you sqldf to
> >>
> >> > z  <- sqldf("select * from x_data left join y_data using
> (item_code)"); z
> >>
> >> I can replicate your result. Even after changing the types of the two
> >> item_code fields to match I still get the same result and when I see to
> what
> >> degree they share values I get:
> >>
> >> > sum(x_data$item_code %in% y_data$item_code)
> >> [1] 2
> >> > sum(y_data$item_code %in% x_data$item_code)
> >> [1] 2
> >>
> >>
> >> So why are you so sure they are "complete" as you claimed in your first
> >> email.
> >>
> >> --
> >> David.
> >>
> >>
> >>
> >> On Mar 12, 2010, at 12:29 PM, David Winsemius wrote:
> >>
> >>  You have now given two different assignments to x_data and none to
> y_data:
> >>>
> >>> The str( from the file access offering:
> >>>
> >>> > str(x_data)
> >>> 'data.frame':   2848 obs. of  5 variables:
> >>> $ item_code       : int  100 110 150 160 161 164 200 210 212 220 ...
> >>> $ name            : chr  "NEONACLEX K" "NEONACLEX" "MESORB" "ABSORBENT
> >>> CELLULOSE MESO" ...
> >>> $ formulation_code: chr  "TABS" "TABS" "DRESS" "DRESS" ...
> >>> $ strength        : chr  "NULL" "5MG" "10CMX10CM" "10CMX10CM" ...
> >>> $ bnf_code        : chr  "2.2.8" "2.2.1" "20.3.1" "20.3.1" ...
> >>>
> >>> The str from assignment from the dput offering
> >>> > str(x_data)
> >>> 'data.frame':   10 obs. of  10 variables:
> >>> $ prochi         : chr  "CAO0000713" "CAO0000713" "CAO0000713"
> >>> "CAO0000713" ...
> >>> $ prescribed_date: chr  "22/06/2001" "28/04/2000" "10/04/2000"
> >>> "03/07/2000" ...
> >>> $ dataMonth      : chr  "NULL" "NULL" "NULL" "NULL" ...
> >>> $ item_code      : chr  "842752" "7800" "842652" "842652" ...
> >>> $ res_seqno      : chr  "NULL" "NULL" "NULL" "NULL" ...
> >>> $ quantity       : chr  "60" "100G" "60" "60" ...
> >>> $ directions     : chr  "1/D" "A/TD" "1/D" "1/D" ...
> >>> $ no_of_packs    : chr  "NULL" "NULL" "NULL" "NULL" ...
> >>> $ datasource     : chr  "TSF" "TSF" "TSF" "TSF" ...
> >>> $ scan_ref_no    : chr  "NULL" "NULL" "NULL" "NULL" ...
> >>>
> >>> This code "worked", but it is not clear that the x-y assignments were
> >>> correct:
> >>>
> >>> x_data <- read.table(file="
> >>> http://n4.nabble.com/file/n1590804/feb09_267_presc_items_tsf.txt";,
> header
> >>> = TRUE, sep = "|", quote = "\"'",
> >>>              dec = ".",as.is = TRUE,na.strings = "NA",colClasses = NA,
> >>> nrows = 3864284,
> >>>              skip = 0, check.names = TRUE,fill=TRUE,
> >>>              strip.white = TRUE, blank.lines.skip = TRUE,
> >>>              comment.char = "#", allowEscapes = FALSE, flush = FALSE,
> >>>              fileEncoding = "", encoding = "unknown")
> >>>
> >>> --
> >>> David.
> >>>
> >>> On Mar 12, 2010, at 12:23 PM, Newbie19_02 wrote:
> >>>
> >>>
> >>>> The y_data file has over 9000 rows in it so I thought it would be more
> >>>> practical to give you the file to download....
> >>>> --
> >>>> View this message in context:
> >>>>
> http://n4.nabble.com/sqldf-not-joining-all-the-fields-tp1590786p1590833.html
> >>>> Sent from the R help mailing list archive at Nabble.com.
> >>>>
> >>>> ______________________________________________
> >>>> [email protected] mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>
> >>>
> >>> David Winsemius, MD
> >>> West Hartford, CT
> >>>
> >>> ______________________________________________
> >>> [email protected] mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >> David Winsemius, MD
> >> West Hartford, CT
> >>
> >>
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > [email protected] mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Fwd: sqldf not joining all the fields

Reply via email to