(or better, a.dtype.hasobject)
On 19 August 2014 16:59, Joel Nothman <[email protected]> wrote: > You can also modify that line in sklearn/externals/joblib/pool.py in your > local copy of scikit-learn to include an additional condition: > and a.dtype.kind != 'O' > > > On 19 August 2014 16:55, Joel Nothman <[email protected]> wrote: > >> Oh well. I'm not a very experienced monkey-patcher. There may be a better >> way to do it (make sure you apply the monkey patch before importing any >> other scikit-learn modules). >> >> >> On 19 August 2014 16:52, Anders Aagaard <[email protected]> wrote: >> >>> It does work with 1 job. >>> >>> I tried your monkey patch: >>> # joblib.Parallel >>> functools.partial(<class 'sklearn.externals.joblib.parallel.Parallel'>, >>> max_nbytes=None) >>> >>> I still get the same error though. >>> >>> >>> >>> On Tue, Aug 19, 2014 at 8:19 AM, Joel Nothman <[email protected]> >>> wrote: >>> >>>> I suspect this is a bug in joblib, and that you won't get it with >>>> n_jobs=1. Joblib employs memmap for inter-process communication if the >>>> array is larger than a fized size: >>>> https://github.com/joblib/joblib/blob/master/joblib/pool.py#L203. It >>>> seems it needs another criterion to check ensure that the data is indeed >>>> memmappable. >>>> >>>> You could monkey-patch joblib's Parallel to be constructed with >>>> max_nbytes=None to disable memmapping (untested): >>>> >>>> from sklearn.externals import joblib >>>> from functools import partial >>>> joblib.Parallel = partial(joblib.Parallel, max_nbytes=None) >>>> # now import other scikit-learn modules... >>>> >>>> >>>> Issue at https://github.com/joblib/joblib/issues/162 >>>> >>>> >>>> On 19 August 2014 05:05, Anders Aagaard <[email protected]> wrote: >>>> >>>>> Hi >>>>> >>>>> I've got a reasonably large dataset I'm trying to do a gridsearch on. >>>>> If I feed in a subset of it it works fine, but if I feed in the entire >>>>> file >>>>> it dies with : "Array can't be memory-mapped: Python objects in dtype.". >>>>> Now I realize what that's telling me, but I seem to remember building >>>>> pipelines with a countvectorizer in it a ton of times, and feeding >>>>> datasets >>>>> with columns of strings to my gridsearches fit methods. Also why would >>>>> this >>>>> work on a small file, but not a large one? >>>>> >>>>> I stuck a fake classifier in the top of my pipeline with some print >>>>> statements to find out if it was my pipeline that was causing it, but I >>>>> never get there. So it seems to be before any of the input data is passed >>>>> to my pipeline. >>>>> >>>>> Backtrace : https://gist.github.com/andaag/f8e4c3df2e41fcc1f84f >>>>> >>>>> Anyone have any ideas whats going on? This is on scikit 0.15.1. The >>>>> dtypes are identical on the large file and the smaller one. >>>>> >>>>> -- >>>>> Best regards >>>>> Anders Aagaard >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> _______________________________________________ >>>>> Scikit-learn-general mailing list >>>>> [email protected] >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>>> >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Scikit-learn-general mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>>> >>>> >>> >>> >>> -- >>> Mvh >>> Anders Aagaard >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >> >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
