Re: [heka] State and future of Heka

Dieter Plaetinck Fri, 06 May 2016 18:47:16 -0700

On Fri, 6 May 2016 10:51:01 -0700
Rob Miller <[email protected]> wrote:


> Hi everyone,
> 
> I'm loooong overdue in sending out an update about the current state of 
> and plans for Heka. Unfortunately, what I have to share here will 
> probably be disappointing for many of you, and it might impact whether 
> or not you want to continue using it, as all signs point to Heka getting 
> less support and fewer updates moving forward.
> 
> The short version is that Heka has some design flaws that make it hard 
> to incrementally improve it enough to meet the high throughput and 
> reliability goals that we were hoping to achieve. While it would be 
> possible to do a major overhaul of the code to resolve most of these 
> issues, I don't have the personal bandwidth to do that work, since most 
> of my time is consumed working on Mozilla's immediate data processing 
> needs rather than general purpose tools these days. Hindsight 
> (https://github.com/trink/hindsight), built around the same Lua sandbox 
> technology as Heka, doesn't have these issues, and internally we're 
> using it more and more instead of Heka, so there's no organizational 
> imperative for me (or anyone else) to spend the time required to 
> overhaul the Go code base.
> 
> Heka is still in use here, though, especially on our edge nodes, so it 
> will see a bit more improvement and at least a couple more releases. 
> Most notably, it's on my list to switch to using the most recent Lua 
> sandbox code, which will move most of the protobuf processing to custom 
> C code, and will likely improve performance as well as remove a lot of 
> the problematic cgo code, which is what's currently keeping us from 
> being able to upgrade to a recent Go version.
> 
> Beyond that, however, Heka's future is uncertain. The code that's there 
> will still work, of course, but I may not be doing any further 
> improvements, and my ability to keep up with support requests and PRs, 
> already on the decline, will likely continue to wane.
> 
> So what are the options? If you're using a significant amount of Lua 
> based functionality, you might consider transitioning to Hindsight. Any 
> Lua code that works in Heka will work in Hindsight. Hindsight is a much 
> leaner and more solid foundation. Hindsight has far fewer i/o plugins 
> than Heka, though, so for many it won't be a simple transition.
> 
> Also, if there's someone out there (an organization, most likely) that 
> has a strong interest in keeping Heka's codebase alive, through funding 
> or coding contributions, I'd be happy to support that endeavor. Some 
> restrictions apply, however; the work that needs to be done to improve 
> Heka's foundation is not beginner level work, and my time to help is 
> very limited, so I'm only willing to support folks who demonstrate that 
> they are up to the task. Please contact me off-list if you or your 
> organization is interested.
> 
> Anyone casually following along can probably stop reading here. Those of 
> you interested in the gory details can read on to hear more about what 
> the issues are and how they might be resolved.
> 
> First, I'll say that I think there's a lot that Heka got right. The 
> basic composition of the pipeline (input -> split -> decode -> route -> 
> process -> encode -> output) seems to hit a sweet spot for composability 
> and reuse. The Lua sandbox, and especially the use of LPEG for text 
> parsing and transformation, has proven to be extremely efficient and 
> powerful; it's the most important and valuable part of the Heka stack. 
> The routing infrastructure is efficient and solid. And, perhaps most 
> importantly, Heka is useful; there are a lot of you out there using it 
> to get work done.
> 
> There was one fundamental mistake made, however, which is that we 
> shouldn't have used channels. There are many competing opinions about Go 
> channels. I'm not going to get in to whether or not they're *ever* a 
> good idea, but I will say unequivocally that their use as the means of 
> pushing messages through the Heka pipeline was a mistake, for a number 
> of reasons.
> 
> First, they don't perform well enough. While Heka performs many tasks 
> faster than some other popular tools, we've consistently hit a 
> throughput ceiling thanks to all of the synchronization that channels 
> require. And this ceiling, sadly, is generally lower than is acceptable 
> for the amount of data that we at Mozilla want to push through our 
> aggregators single system.
> 
> Second, they make it very hard to prevent message loss. If unbuffered 
> channels are used everywhere, performance plummets unacceptably due to 
> context-switching costs. But using buffered channels means that many 
> messages are in flight at a time, most of which are sitting in channels 
> waiting to be processed. Keeping track of which messages have made it 
> all the way through the pipeline requires complicated coordination 
> between chunks of code that are conceptually quite far away from each other.
> 
> Third, the buffered channels mean that Heka consumes much more RAM than 
> would be otherwise needed, since we have to pre-allocate a pool of 
> messages. If the pool size is too small, then Heka becomes susceptible 
> to deadlocks, with all of the available packs sitting in channel queues, 
> unable to be processed because some plugin is blocked on waiting for an 
> available pack. But cranking up the pool size causes Heka to use more 
> memory, even when it's idle.
> 
> Hindsight avoids all of these problems by using disk queues instead of 
> RAM buffers between all of the processing stages. It's a bit 
> counterintuitive, but at high throughput performance is actually better 
> than with RAM buffers, because a) there's no need for synchronization 
> locks and b) the data is typically read quickly enough after it's 
> written that it stays in the disk cache.
> 
> There's much less chance of message loss, because every plugin is 
> holding on to only one message in memory at a time, while using a 
> written-to-disk cursor file to track the current position in the disk 
> buffer. If the plug is pulled mid-process, some messages that were 
> already processed might be processed again, but nothing will be lost, 
> and there's no need for complex coordination between different stages of 
> the pipeline.
> 
> Finally, there's no need for a pool of messages. Each plugin is holding 
> some small number of packs (possibly as few as one) in its own memory 
> space, and those packs never escape that plugin's ownership. RAM usage 
> doesn't grow, and pool exhaustion related deadlocks are a thing of the past.
> 
> For Heka to have a viable future, it would basically need to be updated 
> to work almost exactly like Hindsight. First, all of the APIs would need 
> to be changed to no longer refer to channels. (The fact that we exposed 
> channels to the APIs is another mistake we made... it's now generally 
> frowned upon in Go land to expose channels as part of your public APIs.) 
> There's already a non-channel based API for filters and outputs, but 
> most of the plugins haven't yet been updated to use the new API, which 
> would need to happen.
> 
> Then the hard work would start; a major overhaul of Heka's internals, to 
> switch from channel based message passing to disk queue based message 
> passing. The work that's been done to support disk buffering for filters 
> and outputs is useful, but not quite enough, because it's not scalable 
> for each plugin to have its own queue; the number of open file 
> descriptors would grow very quickly. Instead it would need to work like 
> Hindsight, where there's one queue that all of the inputs write to, and 
> another that filters write to. Each plugin reads through its specified 
> input queue, looking for messages that match its message matcher, 
> writing its location in the queue back to the shared cursors file.
> 
> There would also be some complexity in reconciling Heka's breakdown of 
> the input stage into input/splitter/decoder with Hindsight's 
> encapsulation of all of these stages into a single sandbox.
> 
> Ultimately I think this would be at least 2-3 months full time work for 
> me. I'm not the fastest coder around, but I know where the bodies are 
> buried, so I'd guess it would take anyone else at least as long, 
> possibly longer if they're not already familiar with how everything is 
> put together.
> 
> And that's about it. If you've gotten this far, thanks for reading. 
> Also, thanks to everyone who's contributed to Heka in any way, be it by 
> code, doc fixes, bug reports, or even just appreciation. I'm sorry for 
> those of you using it regularly that there's not a more stable future.
> 
> Regards,
> 
> -r
> _______________________________________________
> Heka mailing list
> [email protected]
> https://mail.mozilla.org/listinfo/heka



Thanks for sharing Rob,
agreed with your points about channels.
I think the "use channels" mantra is too strong, as it is often a bad choice
in high performance programs, and i wish the go team did a better job at 
exposing
that nuance or even explaining the tradeoff / implementation details better - 
btw
wasn't there an upcoming channels refactor? i remember dmitry vyukov made a 
proposal but not sure what's the latest.

FWIW the hindsight design you described sounds a lot like the upcoming NSQ
refactoring that is based on a single disk-backed write-ahead log that then has
a bunch of readers with cursor in the file on disk. see:
* https://github.com/nsqio/nsq/pull/625
* https://github.com/mreiferson/wal (highly recommend checking this out for 
anyone interested, nice piece of code for a WAL that could serve as starting 
point)
* http://dieter.plaetinck.be/post/interview-matt-reiferson-nsq/

of course NSQ being just a messaging system doesn't provide any of the 
input/output/processing/whatever plugins like heka or hindsight does,
but it shows you're not the only one realizing it is the path forward, and 
checking out their code could prove valuable in the endeavor of refactoring 
heka.
(in the past i've found it surprisingly easy to rip out their diskqueue code 
and use it in one of my projects see 
http://dieter.plaetinck.be/post/transplanting-go-packages-for-fun-and-profit/)

ironically those folks also haven't had much time to push the WAL refactor 
forward.

anyway,
take care!


_______________________________________________
Heka mailing list
[email protected]
https://mail.mozilla.org/listinfo/heka

Re: [heka] State and future of Heka

Reply via email to