To be blunt all of the alternative hardware ideas people have tried for
memcached have had practicality issues. Either cost, complexity, or
communication with the main CPU tend to kill it.

It's fun to toy with and at some point someone will make something usable,
I hope. The MS change is neat, solarflare cards have been interesting,
amazon F1 is interesting (but not connected to the network so far as I can
tell).

Honestly it's probably more practical for building low power cache
systems, or low power medium-usage systems with high speed interconnects
to flash storage. You need to implement the TCP stack, the entire daemon,
etc, away from the x86 machine which is a lot of work.

Hybrid approaches (like hot key offloading) seem alright but without fast
access to main memory any sort of scanning workload won't work. You can
get close with offloading by having the FPGA handle networking and DMA
buffers to/from userland network stacks or data structures for attaching
item data to a response... IE; an FPGA would have 32-64g of memory
directly attached and manage stack, connection buffers, hash table, object
headers (plus directly small values, likely). Larger values could be held
in main memory and data could be bulk requested to manage high rates.

It's either that or the FPGA gets pools of RAM or flash via
bankswitching/something to deal with the cache memory directly. This is
how the exotic tilera/etc machines worked with lots of NUMA banks (At a
high level anyway)

Unless something new has happened I don't know about :) PCI has a lot of
bandwidth but transaction delay is still a limiter.

In practical sense memcached can saturate 20-40gbps easily, if the time
spent in the kernel is minimized. You can get there quickly by pipelining.
You can do it today by sticking to multigets or stacking sets/gets via
proxies and using binprot or ASCII noreply.

Some advances should be coming to memcached's frontend to help narrow the
gap and allow all types of requests to pipeline/batch. Then the FGPA's
aren't quite as relevant for getting high performance/low latency.

Then you could have, say, a proxy on each client machine that gathers
request together, pipelining requests to memcached servers (spymemcached
client has done this internally forever; but the current implementation of
binprot generates too many packets at large sizes).

Give me a month or two, maybe? I just merged some fixes for the frontend
and have been giving it more thought.

On Tue, 24 Jan 2017, 'Scott Mansfield' via memcached wrote:

> A colleague recently forwarded this 2014 paper to
> me: https://www.cs.princeton.edu/courses/archive/spring16/cos598F/06560058.pdf
> It's an interesting read. I believe the speedup was based on being able to 
> serve hits for hot keys
> effectively out of the FPGA which would otherwise forward the request to the 
> main process. This would
> require your FPGA to be in the hot path from NIC to CPU, though, so that may 
> or may not work for you. 
>
> IMO this won't work well for small things (e.g. hashing) because the overhead 
> of data transfer alone would
> be slower than the action performed.
>
> Not directly related, but I'd hope you're aware of the in-line network 
> acceleration Microsoft has done in
> their datacenters. It's some really cool stuff and could enlighten you on 
> techniques to use for an inline
> accelerator as it relates to parsing network
> data: 
> https://www.microsoft.com/en-us/research/publication/configurable-cloud-acceleration/
>
>
> Scott Mansfield
> Product > Consumer Science Eng > EVCache > Sr. Software Eng
> {
>   M: 352-514-9452
>   E: [email protected]
>   K: {M: mobile, E: email, K: key}
> }
>
> On Tue, Jan 24, 2017 at 8:11 AM, Ravikiran Gummaluri 
> <[email protected]> wrote:
>       HI ,
>       We are trying to offload some of the functionality of memcacaheD to 
> FPGA to accelerate it. We
>       are  exploring possibilities of software bottle neck and accelerating 
> using FPGA’s . If anyone
>       has already done some profiling and they can help us to understand the 
> functionalities that can
>       improve performance. Any suggestions are welcome.
>
>       Thanks & Regards
>       Ravi G
>
>       From: Scott Mansfield [mailto:[email protected]]
>       Sent: Tuesday, January 24, 2017 7:40 AM
>       To: memcached <[email protected]>
>       Cc: Ravikiran Gummaluri <[email protected]>; Venkata Ravi Shankar 
> Jonnalagadda
>       <[email protected]>; Sunita Jain <[email protected]>
>       Subject: Re: Ordering of commands per connection
>
>       I'm actually also very interested to see anything you can share about 
> your project.
>
>       On Monday, January 23, 2017 at 12:50:03 PM UTC-8, Dormando wrote:
>       Hey,
>
>       I've always wanted to try implementing a server with a xilinx chip. 
> Seems
>       like you folks would be more qualified to do that :)
>
>       The short answer is that the server does guarantee order right now. The
>       ASCII protocol doesn't work very well if you reorder the results, but
>       primarily all clients will have been written with that assumption in 
> mind.
>
>       The longer answer is that binary protocol can technically allow
>       reordering, but it's unclear if any clients support that. Binprot uses
>       opaques or returns keys to tag requests with responses.
>
>       You can still parallelize an ordered ASCII multiget (ie: "get key1 key2
>       key3") by creating the iovec structures ahead of time, doing the
>       hashing/lookup in parallel and filling the results before sending the
>       response.
>
>       With binprot each get/response are independently packaged so it's a bit
>       easier, although the protocol bloat makes it less useful at high rates.
>
>       People have also written papers already on implementing memcached with
>       FPGA's or highly parallel microprocessors (tilera, MIT's tilera 
> precursor,
>       etc). Hopefully you're familiar with them before diving into this.
>
>       May I ask if you can share any other details of this project? is it a
>       proof of concept or some kind of a product?
>
>       have fun,
>       -Dormando
>
>       On Mon, 23 Jan 2017, Ravi Kiran wrote:
>
>       > HI ,
>       > We are planning to use the MemchaheD software and accelerate it with 
> hardware offload. We
>       would like to know
>       > from protocol prospective each connection should maintain the order 
> in which it receives the
>       command  to
>       > send a response back ?
>       > for Ex: If we receive GET1 GET2 SET1 GET3 do we need to send the 
> response in the same order
>       GET1 GET2 SET1
>       > GET3 . Can we parallelize commands and send them out off order ?
>       >
>       > Thanks & Regards
>       > Ravi G 
>       >
>       > --
>       >
>       > ---
>       > You received this message because you are subscribed to the Google 
> Groups "memcached" group.
>       > To unsubscribe from this group and stop receiving emails from it, 
> send an email to
>       > [email protected].
>       > For more options, visit https://groups.google.com/d/optout.
>       >
>       >
>
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to
> [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to