Re: [Scikit-learn-general] Large dataset causing Array can't be memory-mapped. Python objects in dtype.

Joel Nothman Tue, 19 Aug 2014 00:03:07 -0700

(or better, a.dtype.hasobject)


On 19 August 2014 16:59, Joel Nothman <[email protected]> wrote:

> You can also modify that line in sklearn/externals/joblib/pool.py in your
> local copy of scikit-learn to include an additional condition:
> and a.dtype.kind != 'O'
>
>
> On 19 August 2014 16:55, Joel Nothman <[email protected]> wrote:
>
>> Oh well. I'm not a very experienced monkey-patcher. There may be a better
>> way to do it (make sure you apply the monkey patch before importing any
>> other scikit-learn modules).
>>
>>
>> On 19 August 2014 16:52, Anders Aagaard <[email protected]> wrote:
>>
>>> It does work with 1 job.
>>>
>>> I tried your monkey patch:
>>> # joblib.Parallel
>>> functools.partial(<class 'sklearn.externals.joblib.parallel.Parallel'>,
>>> max_nbytes=None)
>>>
>>> I still get the same error though.
>>>
>>>
>>>
>>> On Tue, Aug 19, 2014 at 8:19 AM, Joel Nothman <[email protected]>
>>> wrote:
>>>
>>>> I suspect this is a bug in joblib, and that you won't get it with
>>>> n_jobs=1. Joblib employs memmap for inter-process communication if the
>>>> array is larger than a fized size:
>>>> https://github.com/joblib/joblib/blob/master/joblib/pool.py#L203. It
>>>> seems it needs another criterion to check ensure that the data is indeed
>>>> memmappable.
>>>>
>>>> You could monkey-patch joblib's Parallel to be constructed with
>>>> max_nbytes=None to disable memmapping (untested):
>>>>
>>>> from sklearn.externals import joblib
>>>> from functools import partial
>>>> joblib.Parallel = partial(joblib.Parallel, max_nbytes=None)
>>>> # now import other scikit-learn modules...
>>>>
>>>>
>>>> Issue at https://github.com/joblib/joblib/issues/162
>>>>
>>>>
>>>> On 19 August 2014 05:05, Anders Aagaard <[email protected]> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I've got a reasonably large dataset I'm trying to do a gridsearch on.
>>>>> If I feed in a subset of it it works fine, but if I feed in the entire 
>>>>> file
>>>>> it dies with : "Array can't be memory-mapped: Python objects in dtype.".
>>>>> Now I realize what that's telling me, but I seem to remember building
>>>>> pipelines with a countvectorizer in it a ton of times, and feeding 
>>>>> datasets
>>>>> with columns of strings to my gridsearches fit methods. Also why would 
>>>>> this
>>>>> work on a small file, but not a large one?
>>>>>
>>>>> I stuck a fake classifier in the top of my pipeline with some print
>>>>> statements to find out if it was my pipeline that was causing it, but I
>>>>> never get there. So it seems to be before any of the input data is passed
>>>>> to my pipeline.
>>>>>
>>>>> Backtrace : https://gist.github.com/andaag/f8e4c3df2e41fcc1f84f
>>>>>
>>>>> Anyone have any ideas whats going on? This is on scikit 0.15.1. The
>>>>> dtypes are identical on the large file and the smaller one.
>>>>>
>>>>> --
>>>>> Best regards
>>>>> Anders Aagaard
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>> Mvh
>>> Anders Aagaard
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Large dataset causing Array can't be memory-mapped. Python objects in dtype.

Reply via email to