[Rd] spss long labels

2008-07-09 Thread Kurt Van Dijck

Hi all,

I got no feedback at all concerning the merge of this patch in the source tree.
Am I supposed to do this myself? How should I do this (do I have subversion 
commit
access)? Is this patch acceptable at all? Is it being tested?

I got some personal reactions on my post, proving there is general interest in
getting rid of the inconvenience of importing long labels from SPSS files.



Kurt Van Dijck wrote:

Hi,

A frequently seen issue with importing SPSS data files, is that R does
not import the 'long variable names'.
I built a patch on the R-project's foreign module, in order to import
the 'long variable names' from SPSS (record 7, subtype 13).
To complete the job, I had to expand the "struct variable" definition
to have 64 +1 charachters. I'm not aware of side effects.
The sfm-read.c code works fine.
I didn't test a variety of platforms, as I don't have an idea of what is
regarded as sufficient testing. Anyway, I don't expect major troubles there
(no byteswapping problems, no 32<->64 bit issues) as it's mainly 
character processing.

The patch is relative to the foreign directory. It was created
against the trunk of R-project yesterday.

We would appreciate that you import such patch into the main tree.

Kind regards,

Kurt Van Dijck (C programmer) & Ilse Laurijssen (R user)
Belgium

Index: src/sfm-read.c
===
--- src/sfm-read.c(revision 5168)
+++ src/sfm-read.c(working copy)
@@ -188,6 +188,8 @@
 static int read_variables (struct file_handle * h, struct variable *** 
var_by_index);
 static int read_machine_int32_info (struct file_handle * h, int size, 
int count, int *encoding);
 static int read_machine_flt64_info (struct file_handle * h, int size, 
int count);
+static int read_long_var_names (struct file_handle * h, struct 
dictionary *

+, unsigned long size, unsigned int count);
 static int read_documents (struct file_handle * h);

 /* Displays the message X with corrupt_msg, then jumps to the lossage
@@ -418,11 +420,15 @@
 break;

   case 7: /* Multiple-response sets (later versions of SPSS). */
-  case 13:  /* long variable names. PSPP now has code for these
-   that could be ported if someone is interested. */
 skip = 1;
 break;

+  case 13:/* long variable names. PSPP now has code for these
+   that could be ported if someone is interested. */
+if (!read_long_var_names(h, ext->dict, data.size, data.count))
+  goto lossage;
+break;
+
   case 16: /* See 
http://www.nabble.com/problem-loading-SPSS-15.0-save-files-t2726500.html */

 skip = 1;
 break;
@@ -584,14 +590,72 @@
   return 0;
 }

+/* Read record type 7, subtype 13.
+ * long variable names
+ */
 static int
+read_long_var_names (struct file_handle * h, struct dictionary * dict
+, unsigned long size, unsigned int count)
+{
+  char * data;
+  unsigned int j;
+  struct variable ** lp;
+  struct variable ** end;
+  char * p;
+  char * endp;
+  char * val;
+  if ((1 != size)||(0 == count)) {
+warning("%s: strange record info seen, size=%u, count=%u"
+  ", ignoring long variable names"
+  , h->fn, size, count);
+return 0;
+  }
+  size *= count;
+  data = Calloc (size +1, char);
+  bufread(h, data, size, 0);
+  /* parse */
+  end = &dict->var[dict->nvar];
+  p = data;
+  do {
+if (0 != (endp = strchr(p, '\t')))
+  *endp = 0; /* put null terminator */
+if (0 == (val = strchr(p, '='))) {
+  warning("%s: no long variable name for variable '%s'", h->fn, p);
+} else {
+  *val = 0;
+  ++val;
+  /* now, p is key, val is long name */
+  for (lp = dict->var; lp < end; ++lp) {
+if (!strcmp(lp[0]->name, p)) {
+  strncpy(lp[0]->name, val, sizeof(lp[0]->name));
+  break;
+}
+  }
+  if (lp >= end) {
+warning("%s: long variable name mapping '%s' to '%s'"
+"for variable which does not exist"
+, h->fn, p, val);
+  }
+}
+p = &endp[1]; /* put to next */
+  } while (endp);
+
+  free(data);
+  return 1;
+
+lossage:
+  free(data);
+  return 0;
+}
+
+static int
 read_header (struct file_handle * h, struct sfm_read_info * inf)
 {
   struct sfm_fhuser_ext *ext = h->ext;/* File extension strcut. */
   struct sysfile_header hdr;/* Disk buffer. */
   struct dictionary *dict;/* File dictionary. */
   char prod_name[sizeof hdr.prod_name + 1];/* Buffer for product 
name. */

-  int skip_amt = 0;/* Amount of product name to omit. */
+  int skip_amt = 0;/* Amount of product name to omit. */
   int i;

   /* Create the dictionary. */
@@ -1495,7 +1559,7 @@
 /* Reads one case from system file H into the value array PERM
according to the instructions given in associated dictionary DICT,
which must have the get.* elements appropriately set.  Returns
-   nonzero only if successful.  */
+   n

[Rd] memory leak in sub("[range]",...)

2008-07-09 Thread Bill Dunlap
There is a 2-block memory leak in the sub() (or any other regex-related
function, probably) when the pattern argument involves a range
expression, e.g., '[0-9]'.

% R --debugger=valgrind --debugger-args=--leak-check=full --vanilla
==14519== Memcheck, a memory error detector.
==14519== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
==14519== Using LibVEX rev 1658, a library for dynamic binary translation.
==14519== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==14519== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
==14519== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
==14519== For more details, rerun with: -v
==14519==

R version 2.8.0 Under development (unstable) (2008-07-07 r46046)
...
> for(i in 1:1000)sub("[a-c]","+","0abcd")
> q()
==32503==
==32503== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 40 from 2)
==32503== malloc/free: in use at exit: 12,603,409 bytes in 7,915 blocks.
==32503== malloc/free: 61,973 allocs, 54,058 frees, 54,494,371 bytes
allocated.
==32503== For counts of detected errors, rerun with: -v
==32503== searching for pointers to 7,915 not-freed blocks.
==32503== checked 12,616,568 bytes.
==32503==
==32503== 4 bytes in 1 blocks are possibly lost in loss record 1 of 45
==32503==at 0x40046EE: malloc (vg_replace_malloc.c:149)
==32503==by 0x4005B9A: realloc (vg_replace_malloc.c:306)
==32503==by 0x80A5F92: parse_expression (regex.c:5202)
==32503==by 0x80A614F: parse_branch (regex.c:4707)
==32503==by 0x80A621A: parse_reg_exp (regex.c:4666)
==32503==by 0x80A6618: Rf_regcomp (regex.c:4635)
==32503==by 0x8110CB4: do_gsub (character.c:1355)
==32503==by 0x80654A4: do_internal (names.c:1135)
==32503==by 0x815F0EB: Rf_eval (eval.c:461)
==32503==by 0x8160DA7: do_begin (eval.c:1174)
==32503==by 0x815F0EB: Rf_eval (eval.c:461)
==32503==by 0x8162210: Rf_applyClosure (eval.c:667)
==32503==
... ignore 85 byte/4 block leak in readline ...
==32503== 7,980 bytes in 1,995 blocks are definitely lost in loss record 36 of
45
==32503==at 0x40046EE: malloc (vg_replace_malloc.c:149)
==32503==by 0x4005B9A: realloc (vg_replace_malloc.c:306)
==32503==by 0x80A5F92: parse_expression (regex.c:5202)
==32503==by 0x80A614F: parse_branch (regex.c:4707)
==32503==by 0x80A621A: parse_reg_exp (regex.c:4666)
==32503==by 0x80A6618: Rf_regcomp (regex.c:4635)
==32503==by 0x8110CB4: do_gsub (character.c:1355)
==32503==by 0x80654A4: do_internal (names.c:1135)
==32503==by 0x815F0EB: Rf_eval (eval.c:461)
==32503==by 0x8160DA7: do_begin (eval.c:1174)
==32503==by 0x815F0EB: Rf_eval (eval.c:461)
==32503==by 0x8162210: Rf_applyClosure (eval.c:667)

The leaked blocks are allocated in iinternal_function build_range_exp() at
   5200 /* Use realloc since mbcset->range_starts and
mbcset->range_ends
   5201are NULL if *range_alloc == 0.  */
   5202 new_array_start = re_realloc (mbcset->range_starts,
wchar_t,
   5203   new_nranges);
   5204 new_array_end = re_realloc (mbcset->range_ends, wchar_t,
   5205 new_nranges);
...
   5210 mbcset->range_starts = new_array_start;
   5211 mbcset->range_ends = new_array_end;

This file, src/main/regex.c, contains a complicated mess of #ifdef's
but range_starts and range_ends are defined and appear to be used
whether or not _LIBC is defined.  However, they are only freed if _LIBC
is defined.  In my setup (Linux, gcc 3.4.5) _LIBC is not defined so
they don't get freed.

After the following change in free_charset() only the 85 byte/4 block
leak in readline remains.

Index: regex.c
===
--- regex.c (revision 46046)
+++ regex.c (working copy)
@@ -6240,9 +6240,9 @@
 # ifdef _LIBC
   re_free (cset->coll_syms);
   re_free (cset->equiv_classes);
+# endif
   re_free (cset->range_starts);
   re_free (cset->range_ends);
-# endif
   re_free (cset->char_classes);
   re_free (cset);
 }

[This report may be a duplicate: I tried submitting it via the form in
http://bugs.r-project.org/cgi-bin/R, but I cannot find it there now.]


Bill Dunlap
Insightful Corporation
bill at insightful dot com

 "All statements in this message represent the opinions of the author and do
 not necessarily reflect Insightful Corporation policy or position."

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel