>>>>> C W <tmrs...@gmail.com> >>>>> on Fri, 20 Oct 2017 16:01:06 -0400 writes:
> Subsetting using [] vs. head(), gives different results. > R code: >> head(train$data, 5) > [1] 0 0 1 0 0 The above is surprising ... and points to a bug somewhere. It is different (and correct) after you do require(Matrix) but I think something like that should happen semi-automatically. As I just see, it is even worse if you get the data from xgboost without loading the xgboost package, which you can do (and is also more efficient !): If you start R, and then do data(agaricus.train, package='xgboost') loadedNamespaces() # does not contain "xgboost" nor "Matrix" so, no wonder head(agaricus.train $ data) does not find head()s "Matrix" method [which _is_ exported by Matrix via exportMethods(.)]. But even more curiously, even after I do loadNamespace("Matrix") methods(head) now does show the "Matrix" method, but then head() *still* does not call it. There's a bug somewhere and I suspect it's in R's data() or methods package or ?? rather than in 'Matrix'. But that will be another thread on R-devel or R's bugzilla. Martin >> train$data[1:5, 1:5] > 5 x 5 sparse Matrix of class "dgCMatrix" > cap-shape=bell cap-shape=conical cap-shape=convex > [1,] . . 1 > [2,] . . 1 > [3,] 1 . . > [4,] . . 1 > [5,] . . 1 > cap-shape=flat cap-shape=knobbed > [1,] . . > [2,] . . > [3,] . . > [4,] . . > [5,] . . > On Fri, Oct 20, 2017 at 3:51 PM, C W <tmrs...@gmail.com> wrote: >> Thank you for your responses. >> >> I guess I don't feel alone. I don't find the documentation go into any >> detail. >> >> I also find it surprising that, >> >> > object.size(train$data) >> 1730904 bytes >> >> > object.size(as.matrix(train$data)) >> 6575016 bytes >> >> the dgCMatrix actually takes less memory, though it *looks* like the >> opposite. >> >> Cheers! >> >> On Fri, Oct 20, 2017 at 3:22 PM, David Winsemius <dwinsem...@comcast.net> >> wrote: >> >>> >>> > On Oct 20, 2017, at 11:11 AM, C W <tmrs...@gmail.com> wrote: >>> > >>> > Dear R list, >>> > >>> > I came across dgCMatrix. I believe this class is associated with sparse >>> > matrix. >>> >>> Yes. See: >>> >>> help('dgCMatrix-class', pack=Matrix) >>> >>> If Martin Maechler happens to respond to this you should listen to him >>> rather than anything I write. Much of what the Matrix package does appears >>> to be magical to one such as I. >>> >>> > >>> > I see there are 8 attributes to train$data, I am confused why are there >>> so >>> > many, some are vectors, what do they do? >>> > >>> > Here's the R code: >>> > >>> > library(xgboost) >>> > data(agaricus.train, package='xgboost') >>> > data(agaricus.test, package='xgboost') >>> > train <- agaricus.train >>> > test <- agaricus.test >>> > attributes(train$data) >>> > >>> >>> I got a bit of an annoying surprise when I did something similar. It >>> appearred to me that I did not need to load the xgboost library since all >>> that was being asked was "where is the data" in an object that should be >>> loaded from that library using the `data` function. The last command asking >>> for the attributes filled up my console with a 100K length vector (actually >>> 2 of such vectors). The `str` function returns a more useful result. >>> >>> > data(agaricus.train, package='xgboost') >>> > train <- agaricus.train >>> > names( attributes(train$data) ) >>> [1] "i" "p" "Dim" "Dimnames" "x" "factors" >>> "class" >>> > str(train$data) >>> Formal class 'dgCMatrix' [package "Matrix"] with 6 slots >>> ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ... >>> ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 >>> ... >>> ..@ Dim : int [1:2] 6513 126 >>> ..@ Dimnames:List of 2 >>> .. ..$ : NULL >>> .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" >>> "cap-shape=convex" "cap-shape=flat" ... >>> ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ... >>> ..@ factors : list() >>> >>> > Where is the data, is it in $p, $i, or $x? >>> >>> So the "data" (meaning the values of the sparse matrix) are in the @x >>> leaf. The values all appear to be the number 1. The @i leaf is the sequence >>> of row locations for the values entries while the @p items are somehow >>> connected with the columns (I think, since 127 and 126=number of columns >>> from the @Dim leaf are only off by 1). >>> >>> Doing this > colSums(as.matrix(train$data)) >>> cap-shape=bell cap-shape=conical >>> 369 3 >>> cap-shape=convex cap-shape=flat >>> 2934 2539 >>> cap-shape=knobbed cap-shape=sunken >>> 644 24 >>> cap-surface=fibrous cap-surface=grooves >>> 1867 4 >>> cap-surface=scaly cap-surface=smooth >>> 2607 2035 >>> cap-color=brown cap-color=buff >>> 1816 >>> # now snipping the rest of that output. >>> >>> >>> >>> Now this makes me think that the @p vector gives you the cumulative sum >>> of number of items per column: >>> >>> > all( cumsum( colSums(as.matrix(train$data)) ) == train$data@p[-1] ) >>> [1] TRUE >>> >>> > >>> > Thank you very much! >>> > >>> > [[alternative HTML version deleted]] >>> >>> Please read the Posting Guide. Your code was not mangled in this >>> instance, but HTML code often arrives in an unreadable mess. >>> >>> > >>> > ______________________________________________ >>> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > https://stat.ethz.ch/mailman/listinfo/r-help >>> > PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> > and provide commented, minimal, self-contained, reproducible code. >>> >>> David Winsemius >>> Alameda, CA, USA >>> >>> 'Any technology distinguishable from magic is insufficiently advanced.' >>> -Gehm's Corollary to Clarke's Third Law >>> >>> >>> >>> >>> >>> >> > [[alternative HTML version deleted]] > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.