Re: Running out of memory when using filter?

Rich Hickey Fri, 12 Dec 2008 15:45:14 -0800

On Dec 12, 6:15 pm, "Mark Engelberg" <[email protected]> wrote:
> On Fri, Dec 12, 2008 at 6:37 AM, Rich Hickey <[email protected]> wrote:
> > I'm appreciate the time you and others have spent on this, and will
> > improve filter, but I'm not sure where you are getting your
> > presumptions about lazy sequences. They are not a magic bullet that
> > makes working with data bigger than memory transparent. They do make
> > it possible in some cases, but that is not their intent. Latent
> > runtime memory consumption is a fact of life of laziness, and is
> > present in Haskell etc. I can make it go away for the specific case of
> > filter and the rest of the library, but not generally.
>
> I think this is a great discussion to have, so let me see if I can
> articulate my thoughts a little better.
>
> First, let's talk about laziness, because I think we're in agreement
> here.  Laziness is undoubtedly a very powerful tool.  In Haskell,
> laziness is the default.  Every single function call is automatically
> lazy.  In LISP/Scheme, laziness is optional.  It's something you have
> to specifically opt into using delay and force.  Which is better?
>
> Peter Van Roy, the designer of the Oz programming language, has argued
> in many places that laziness is not a good default for a
> general-purpose programming language.  This is also a point he
> discusses in his excellent book "Concepts, Techniques, and Models of
> Computer Programming".  He has probably explained this point better
> than I can, so I won't belabor it here, but in a nutshell, it all
> comes down to the fact that laziness makes the performance (especially
> space performance) of your program much harder to analyze.  It is all
> too easy to write a Haskell program that blows up or performs poorly,
> and you have no idea why.  You can get back some performance by
> opting-out of the laziness with strictness annotations, but it can be
> difficult to figure out how or where to do this.  To put it simply,
> it's easier to write high-performing code when you opt-in to laziness
> when you need it, rather than trying to figure out how to opt out when
> it's wrecking things for you.
>
> I assume that you agree with this, since you have chosen the explicit
> force/delay "opt-in laziness" model for Clojure.
>
> So now let's talk about sequences, specifically the behavior of
> lazily-generated sequences.  A similar choice needs to be considered.
> It is easy to imagine two possible implementations of lazy-cons.
> lazy-cons-cached would work exactly as lazy-cons currently does.
> lazy-cons-uncached would be similar, but would not cache its first and
> rest values when computed.  Sequences built with either version of
> lazy-cons would both respond to the same sequence interface, and would
> therefore behave the same, producing the same results as long as the
> generating functions don't refer to mutable data, but possibly with
> different performance profiles.  lazy-cons-cached sequences might run
> faster when traversing the same lazy list more than once, but also
> might crash your program by consuming too much memory in places where
> lazy-cons-uncached would not.  It's a tradeoff, and both versions of
> lazy-cons are potentially useful, but which should be the default, or
> more specifically, which should be the default used by all of
> Clojure's library functions like map, filter, etc.?  Should these
> lazily-generated sequences be automatically cached, or not?
>
> The first question is, does it matter?  I would say that yes, this
> design decision matters a great deal.  I think I remember you saying
> somewhere (in a post here, or possibly in one of your presentations)
> that sequences, especially lazy sequences, have become a far more
> important part of Clojure programming than you envisioned at the
> outset.  And it is easy to see why lazy sequences take on a
> significant role in a fairly pure functional programming language:
>
> People coming from imperative languages often ask how you code up
> certain kinds of for/while loops that are used to traverse data
> structures and build up new ones.  Let's say you want to build a list
> of all the even squares of whole numbers up to 100, inclusively.  One
> literal way to formulate this problem in an imperative language
> (taking Python 2.6 syntax as a sample) would be:
> l = []
> for i in xrange(100):
>   square = i*i
>   if (square%2 == 0):
>      l.append(square)
>
> Now in Clojure, things like this can be written using loop/recur.  But
> any newcomer to Clojure will certainly be told that there's a more
> elegant way to express this concept:
>
> (def l (filter even? (map #(* % %) (range 100))))
>
> or the equivalent comprehension.
>
> The newcomer's first reaction is to say, "Whoa, but won't that have
> poor performance?  You're generating an awful lot of big, intermediate
> lists."  And the answer is usually, "Well, it's lazy, so no, in fact,
> this elegant form behaves like the iterative version.  You can have
> your cake and eat it too."
>
> You said, "I'm not sure where you are getting your presumptions about
> lazy sequences. They are not a magic bullet that makes working with
> data bigger than memory transparent."  Well, actually, I know that
> cached lazy sequences do not always behave like their iterative
> counterparts.  But this is where these presumptions are born.
> Iterative stuff is ugly in functional languages.  We are told to use
> comprehensions, map, filter, tricks like Stuart Holloway's blog about
> writing the indexed function which makes a temporary sequence of pairs
> of indices and values to avoid certain imperative patterns.  And we
> expect the behavior to be similar, because most of the time it is.
> When something blows up from some subtle capturing of a lazy list, we
> feel frustrated and betrayed.
>
> Next, let's consider whether it is easier to opt-in or opt-out of
> caching your sequences.  Opting out of sequence caching is very
> difficult.  Right now, all the built-in sequence functions use a
> cached lazy-cons, and there is no way to undo that caching.  The first
> line of defense is to make sure that you don't give an intermediate
> temporary name to any of the lazy lists you are transforming.  In
> other words, you should never do something like:
>
> (def squares (map #(* % %) (range 100))
> (def even-squares (filter even? squares)).
>
> This kind of thing could crash your program with large lists.  It's
> irritating to have to be careful about this, because it seems
> intutitive that giving a temporary name to something shouldn't be the
> kind of thing that could crash your program, but okay, let's say you
> learn to be really careful, and to carefully avoid naming your lists.
> But then, sometimes your lists get captured by closures, or other
> subtle corner cases, and it just gets frustrating.  Using lazy
> sequences, which are pervasive in Clojure, should not be so difficult.
>  Right now the only solution is to rewrite one's code to use
> loop/recur, and it sounds like eventually there will be an alternative
> "generators" system which you say you're envisioning more as an I/O
> tool with a special interface, rather than as a general-purpose form
> of sequences.
>
> On the other hand, opting in to sequence caching is very easy.  You
> only gain the benefits of a cached sequence when you use it twice,
> which inherently means that you have to name it or bind it to
> something in order to refer to it multiple times.  So if I'm going to
> be traversing a sequence multiple times, I just opt in by calling a
> function to turn it into a cached sequence.  (Let's call it
> "cached-seq".  I know there is such a function in the Clojure API
> already, but I haven't really verified that it works the way I mean
> here.  So if I'm using the term in a different way from the way that
> cached-seq actually works in the current API, please bear with me for
> the sake of argument.)
>
> So with an opt-in system, where range, map and filter use
> lazy-cons-uncached, you could easily write something like:
> (def squares (map #(* % %) (range 100))
> (def even-squares (cached-seq (filter even? squares)))
>
> This means that squares behaves exactly like its iterative
> counterpart, despite the fact that I named it, and even-squares I'm
> setting up with caching because I intend to use it repeatedly later in
> my program.
>
> Is the opt-out or opt-in system more intuitive, and which is easier to
> analyze from a correctness and performance standpoint?  I'd argue that
> the opt-out system has serious problems with respect to being
> intuitive, and ease of confirming correctness and performance bounds.
> Programmers really want these things (most of the time) to behave like
> the iterative loops they are replacing.  It can be very subtle to
> detect the kinds of things that can cause an explosion of memory
> consumption.  The filter-map explosion issue is a great case in point.
>  You can test it on a wide variety of inputs, even big inputs, and it
> seems to work fine.  But then when you pass it a combination of filter
> and map that results in a large "window" between filtered elements in
> the mapped list, it blows up.  Even if you patch the specific behavior
> of filter, this is indicative of a larger problem.  filter is
> currently written in the most natural way using lazy-cons.  Other
> programmers are going to write similar functions using lazy-cons, and
> these programs will all be brittle and prone to unpredictable failure
> on certain kinds of large inputs.
>
> On the other hand, the opt-in system is fairly straightforward to
> understand.  If a sequence is not explicitly cached, you can expect no
> memory to be consumed by a traversal, and the traversal time will be
> the same every time you execute it.  Caching a sequence becomes
> explcit, and is then clearly identified in the code as such.
>
>
>
> > Not caching would let to excessive recalculation in many contexts, and
> > then people would be complaining about performance.
>
> Let's talk about performance.  First, I would claim that the vast
> majority of lazy sequences (especially the really big ones), are used
> as one-off temporary sequences to more elegantly express something
> that would be expressed via looping in an imperative language.  So for
> these cases (which I believe to be the most common case), you gain
> nothing by caching, and in fact, you lose something (a small amount of
> increased memory consumption / garbage collection, and increased
> brittleness and unpredictability).
>
> Then there are some sequences which are traversed more than once, but
> the computation is fairly trivial (e.g., the output of range).  In
> these situations, it's probably a wash.  I remember reading a paper
> not long ago (can't remember exactly which one, sorry) that pointed
> out that most programmers' intuitions about the need to cache
> intermediate results is often wrong, and simple computations should
> just be recomputed.  This is because chips have gotten really fast,
> and one of the biggest performance hits these days is a cache miss, so
> if you store something, and it requires the program to go off and look
> in the "slow part of memory" to find it, you're much worse off than if
> you had just recomputed.
>
> Sometimes, if you know you are going to be traversing a sequence more
> than once, you would be better off converting it to an explicitly
> realized list or putting the data in a vector.
>
> So this leaves what I believe to be a small set of cases where you
> genuinely need a cached-but-lazy list.  Yes, make this a possibility;
> I don't deny that it is useful.  But I think your fears about people
> complaining about performance if you change the default behavior are
> unfounded.  It's relatively easy to say to people, "If you need better
> performance when traversing a lazy sequence multiple times, you may
> benefit from explicitly realizing the result of the intermediate lazy
> computations, or using a cached lazy sequence if that's what you
> need."  On the other hand, if you keep things as they are, I can
> pretty much guarantee that you will be faced with ongoing posts to the
> list of, "Help!  I've got this program that is giving me an out of
> memory error, and I can't figure out why."
>
> One issue is that you'd want to make that cached-seq behaves
> intelligently when someone tries to cache something that's essentially
> already cached.  That way people could more easily write generic code
> that safely calls cached-seq on various collections to guarantee
> certain kinds of time-performance bounds with repeat traversals.  For
> example, calling cached-seq on a vector shouldn't really do anything.
> But there are already plenty of precedents and facilities in Clojure
> for handling this sort of thing.  We already expect seq to "do the
> right thing" and not add extra layers of "sequifying" to something
> that already is a sequence.  One idea is to have a meta tag of :cached
> that applies to things like vectors, sets, maps, lists, but not lazy
> lists built with the lazy-cons-uncached variant I've proposed in this
> post.  cached-seq acts like an ordinary seq on things with the :cached
> tag.  cached-seq of a lazy-cons-uncached cell would essentially just
> construct a lazy-cons-cached cell with the same first and rest
> function, but with the cache fields.  In the general case, cached-seq
> would generate a lazy list built with lazy-cons-cached.  Of course,
> lazy-cons-cached cells would be marked with a :cached meta tag, and
> would guarantee that the rest, when realized, is also made into a
> cached sequence.
>
> > There are many
> > benefits of the thread-safe once-and-only-once evaluation promise of
> > lazy-cons that you may not be considering. It is certainly not a bad
> > default.
>
> Well, I've tried as best as I can to articulate my reasoning that
> lazy-cons-cached is in fact a bad default.  It is potentially
> unintuitive, hard to verify correctness and performance bounds, and
> difficult to opt out of.  The Scala programming language offers both
> Streams (uncached lazy list), and LazyList (cached lazy list), and
> generally uses Streams as the default.  F# uses Seqs (uncached lazy
> list), and LazyList (cached lazy list), and generally uses Seqs as the
> default.  Both of these languages are among the closest comparisons to
> Clojure, in terms of meshing practical functional programming within
> an existing imperative VM, and they have both made this design choice
> to favor uncached lazy lists.  And in fact, it turns out that in those
> languages, uncached lazy lists end up rarely used.  And of course,
> languages like Python get along just fine with "generators" and no
> built-in cached variant at all, other than explicitly realizing to a
> list.  Haskell, of course, is the one exception.  They use
> cached-lazy-lists by default, but then again, they really have to,
> because it's the only sensible thing in a language where everything is
> lazy by default.
>
> > There will soon be a streams/generator library, intended for i/o, that
> > will do one-pass non-cached iteration. It will offer a different set
> > of tradeoffs and benefits, including reduced ease-of use - not being
> > persistent removes a lot of things like first/rest/nth, destructuring
> > etc.
>
> Right, I'm talking about making something that works just like any
> other sequence, supporting first/rest/nth and destructuring.
>
>
>
> > But the important thing is tradeoffs/benefits. They always come
> > together.
>
> Yes, but hopefully I can convince you that in this case, the tradeoffs
> fall clearly on the side of defaulting to uncached lazy lists.
>

I'd appreciate it if you could make more succinct arguments.

The argument you've made, however, is theoretical.

I've tried what you said. In fact, it's sitting there in the
implementation right now - just uncomment lazy-seq. cached-seq is
there, as is a non-caching LazySeq class implementing ISeq, and all of
the library functions can be defined in terms of it. I also did that,
and then tried it and some common code.

You should too, and report back your actual experience.

My experience was:

Everything with side-effects breaks. Lots of performance problems,
because it is not as simple as what you say, clear multiple
traversals, but often casual multiple local references to (rest x) and
other parts of the seq, causing multiple evaluations and corresponding
runtime multipliers. In short, an entirely different set of things you
need to be careful of, and far more often. I became so concerned about
having to support it that I commented out lazy-seq.

The short of it is, the problem you encountered is not, in fact,
common. But I encourage you to try your idea, as I said, I've already
implemented it and you can try it right now.

Rich

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---
Re: Running out of memory when using filter?

Reply via email to