http://www.fragmentationneeded.net/2011/12/pricing-and-trading-networks-down-is-up.html
Pricing and Trading Networks: Down is Up, Left is Right My introduction to enterprise networking was a little backward. I started out supporting trading floors, backend pricing systems, low-latency algorithmic trading systems, etc... I got there because I'd been responsible for UNIX systems producing and consuming multicast data at several large financial firms. Inevitably, the firm's network admin folks weren't up to speed on matters of performance tuning, multicast configuration and QoS, so that's where I focused my attention. One of these firms offered me a job with the word "network" in the title, and I was off to the races. It amazes me how little I knew in those days. I was doing PIM and MSDP designs before the phrases "link state" and "distance vector" were in my vocabulary! I had no idea what was populating the unicast routing table of my switches, but I knew that the table was populated, and I knew what PIM was going to do with that data. More incredible is how my ignorance of "normal" ways of doing things (AVVID, SONA, Cisco Enterprise Architecture, multi-tier designs, etc...) gave me an advantage over folks who had been properly indoctrinated. My designs worked well for these applications, but looked crazy to the rest of the network staff (whose underperforming traditional designs I was replacing). The trading floor is a weird place, with funny requirements. In this post I'm going to go over some of the things that make trading floor networking... Interesting. Redundant Application Flows The first thing to know about pricing systems is that you generally have two copies of any pricing data flowing through the environment at any time. Ideally, these two sets originate from different head-end systems, get transit from different wide area service providers, ride different physical infrastructure into opposite sides of your data center, and terminate on different NICs in the receiving servers. If you're getting data directly from an exchange, that data will probably be arriving as multicast flows. Redundant multicast flows. The same data arrives at your edge from two different sources, using two different multicast groups. If you're buying data from a value-add aggregator (Reuters, Bloomberg, etc...), then it probably arrives via TCP from at least two different sources. The data may be duplicate copies (redundancy), or be distributed among the flows with an N+1 load-sharing scheme. Losing One Packet Is Bad Most application flows have no problem with packet loss. High performance trading systems are not in this category. Think of the state of the pricing data like a spreadsheet. The rows represents a securities -- something that traders buy and sell. The columns represent attributes of that security: bid price, ask price, daily high and low, last trade price, last trade exchange, etc... Our spreadsheet has around 100 columns and 200,000 rows. That's 20 million cells. Every message that rolls in from a multicast feed updates one of those cells. You just lost a packet. Which cell is wrong? Easy answer: All of them. If a trader can't trust his data, he can't trade. These applications have repair mechanisms, but they're generally slow and/or clunky. Some of them even involve touch tone. Really: The Securities Industry Automation Corporation (SIAC) provides a retransmission capability for the output data from host systems. As part of this service, SIAC provides the AutoLink facility to assist vendors with requesting retransmissions by submitting requests over a touch-tone telephone set Reconvergence Is Bad Because we've got two copies of the data coming in. There's no reason to fix a single failure. If something breaks, you can let it stay broken until the end of the day. What's that? You think it's worth fixing things with a dynamic routing protocol? Okay cool, route around the problem. Just so long as you can guarantee that "flow A" and "flow B" never traverse the same core router. Why am I paying for two copies of this data if you're going to push it through a single device? You just told me that the device is so fragile that you feel compelled to route around failures! Don't Cluster the Firewalls The same reason we don't let routing reconverge applies here. If there are two pricing firewalls, don't tell them about each other. Run them as standalone units. Put them in separate rooms, even. We can afford to lose half of a redundant feed. We cannot afford to lose both feeds, even for the few milliseconds required for the standby firewall take over. Two clusters (four firewalls) would be okay, just keep the "A" and "B" feeds separate! Don't team the server NICs The flow-splitting logic applies all the way down to the servers. If they've got two NICs available for incoming pricing data, these NICs should be dedicated per-flow. Even if there are NICs-a-plenty, the teaming schemes are all bad news because like flows, application components are also disposable. It's okay to lose one. Getting one back? That's sometimes worse. Keep reading... Recovery Can Kill You Most of these pricing systems include a mechanism for data receivers to request retransmission of lost data, but the recovery can be a problem. With few exceptions, the network applications in use on the trading floor don't do any sort of flow control. It's like they're trying to hurt you. Imagine a university lecture where a sleeping student wakes up, asks the lecturer to repeat the last 30 minutes, and the lecturer complies. That's kind of how these systems work. Except that the lecturer complies at wire speed, and the whole lecture hall full of students is compelled to continue taking notes. Why should the every other receiver be penalized because one system screwed up? I've got trades to clear! The following snapshot is from the Cisco CVD for trading systems. it shows how aggressive these systems can be. A nominal 5Mb/s trading application regularly hits wire-speed (100Mb/s) in this case. The graph shows a small network when things are working right. A big trading backend at a large financial services firm can easily push that green line into the multi-gigabit range. Make things interesting by breaking stuff and you'll over-run even your best 10Gb/s switch buffers (6716 cards have 90MB per port) easily. Slow Servers Are Good Lots of networks run with clients deliberately connected at slower speeds than their server. Maybe you have 10/100 ports in the wiring closet and gigabit-attached servers. Pricing networks require exactly the opposite. The lecturer in my analogy isn't just a single lecturer. It's a team of lecturers. They all go into wire-speed mode when the sleeping student wakes up. How will you deliver multiple simultaneous gigabit-ish multicast streams to your access ports? You can't. I've fixed more than one trading system by setting server interfaces down to 100Mb/s or even 10Mb/s. Fast clients, slow servers is where you want to be. Slowing down the servers can turn N*1Gb/s worth of data into N*100Mb/s -- something we can actually handle. Bad Apple Syndrome The sleeping student example is actually pretty common. It's amazing to see the impact that can arise from things like: a clock update on a workstation ripping a CD with iTunes briefly closing the lid on a laptop The trading floor is usually a population of Windows machines with users sitting behind them. Keeping these things from killing each other is a daunting task. One bad apple will truly spoil the bunch. How Fast Is It? System performance is usually measured in terms of stuff per interval. That's meaningless on the trading floor. The opening bell at NYSE is like turning on a fire hose. The only metric that matters is the answer to this question: Did you spill even one drop of water? How close were you to the limit? Will you make it through tomorrow's trading day too? I read on twitter that Ben Bernanke got a bad piece of fish for dinner. How confident are you now? Performance of these systems is binary. You either survived or you did not. There is no "system is running slow" in this world. Routing Is Upside Down While not unique to trading floors, we do lots of multicast here. Multicast is funny because it relies on routing traffic away from the source, rather than routing it toward the destination. Getting into and staying in this mindset can be a challenge. I started out with no idea how routing worked, so had no problem getting into the multicast mindset :-) NACK not ACK Almost every network protocol relies on data receivers ACKnowledging their receipt of data. But not here. Pricing systems only speak up when something goes missing. QoS Isn't The Answer QoS might seem like the answer to make sure that we get through the day smoothly, but it's not. In fact, it can be counterproductive. QoS is about managed un-fairness... Choosing which packets to drop. But pricing systems are usually deployed on dedicated systems with dedicated switches. Every packet is critical, and there's probably more of them than we can handle. There's nothing we can drop. Making matters worse, enabling QoS on many switching platforms reduces the buffers available to our critical pricing flows, because the buffers necessarily get carved so that they can be allocated to different kinds of traffic. It's counter intuitive, but 'no mls qos' is sometimes the right thing to do. Load Balancing Ain't All It's Cracked Up To Be By default, CEF doesn't load balance multicast flows. CEF load balancing of multicast can be enabled and enhanced, but doesn't happen out of the box. We can get screwed on EtherChannel links too: Sometimes these quirky applications intermingle unicast data with the multicast stream. Perhaps a latecomer to the trading floor wants to start watching Cisco's stock price. Before he can begin, he needs all 100 cells associated with CSCO. This is sometimes called the "Initial Image." He ignores updates for CSCO until he's got the that starting point loaded up. CSCO has updated 9000 times today, so the server unicasts the initial image: "Here are all 100 cells for CSCO as of update #9000: blah blah blah...". Then the price changes, and the server multicasts update #9001 to all receivers. If there's a load balanced path (either CEF or an aggregate link) between the server and client, then our new client could get update 9001 (multicast) before the initial image (unicast) shows up. The client will discard update 9001 because he's expecting a full record, not an update to a single cell. Next, the initial image shows up, and the client knows he's got everything through update #9000. Then update #9002 arrives. Hey, what happened to #9001? Post-mortem analysis of these kinds of incidents will boil down to the software folks saying: We put the messages on the wire in the correct order. They were delivered by the network in the wrong order. ARP Times Out NACK-based applications sit quietly until there's a problem. So quietly that they might forget the hardware address associated with their gateway or with a neighbor. No problem, right? ARP will figure it out... Eventually. Because these are generally UDP-based applications without flow control, the system doesn't fire off a single packet, then sit and wait like it might when talking TCP. No, these systems can suddenly kick off a whole bunch of UDP datagrams destined for a system it hasn't talked to in hours. The lower layers in the IP stack need to hold onto these packets until the ARP resolution process is complete. But the packets keep rolling down the stack! The outstanding ARP queue is only 1 packet deep in many implementations. The queue overflows and data is lost. It's not strictly a network problem, but don't worry. Your phone will ring. Losing Data Causes You to Lose Data There's a nasty failure mode underlying the NACK-based scheme. Lost data will be retransmitted. If you couldn't handle the data flow the first time around, why expect to handle wire speed retransmission of that data on top of the data that's coming in the next instant? If the data loss was caused by a Bad Apple receiver, then all his peers suffer the consequences. You may have many bad apples in a moment. One Bad Apple will spoil the bunch. If the data loss was caused by an overloaded network component, then you're rewarded by compounding increases in packet rate. The exchanges don't stop trading, and the data sources have a large queue of data to re-send. TCP applications slow down in the face of congestion. Pricing applications speed up. Packet Decodes Aren't Available Some of the wire formats you'll be dealing with are closed-source secrets. Others are published standards for which no WireShark decodes are publicly available. Either way, you're pretty much on your own when it comes to analysis. Updates Responding to Will's question about data sources: The streams come from the various exchanges (NASDAQ, NYSE, FTSE, etc...) Because each of these exchanges use their own data format, there's usually some layers of processing required to get them into a common format for application consumption. This processing can happen at a value-add data distributor (Reuters, Bloomberg, Activ), or it can be done in-house by the end user. Local processing has the advantage of lower latency because you don't have to have the data shipped from the exchange to a middleman before you see it. Other streams come from application components within the company. There are usually some layers of processing (between 2 and 12) between a pricing update first hitting your equipment, and when that update is consumed by a trader. The processing can include format changes, addition of custom fields, delay engines (delayed data can be given away for free), vendor-switch systems (I don't trust data vendor "A", switch me to "B"), etc... Most of those layers are going to be multicast, and they're going to be the really dangerous ones, because the sources can clobber you with LAN speeds, rather than WAN speeds. As far as getting the data goes, you can move your servers into the exchange's facility for low-latency access (some exchanges actually provision the same length of fiber to each colocated customer, so that nobody can claim a latency disadvantage), you can provision your own point-to-point circuit for data access, you can buy a fat local loop from a financial network provider like BT/Radianz (probably MPLS on the back end so that one local loop can get you to all your pricing and clearing partners), or you can buy the data from a value-add aggregator like Reuters or Bloomberg. Responding to Will's question about SSM: I've never seen an SSM pricing component. They may be out there, but they might not be a super good fit. Here's why: Everything in these setups is redundant, all the way down to software components. It's redundant in ways we're not used to seeing in enterprises. No load-balancer required here. The software components collaborate and share workload dynamically. If one ticker plant fails, his partner knows what update was successfully transmitted by the dead peer, and takes over from that point. Consuming systems don't know who the servers are, and don't care. A server could be replaced at any moment. In fact, it's not just downstream pricing data that's multicast. Many of these systems use a model where the clients don't know who the data sources are. Instead of sending requests to a server, they multicast their requests for data, and the servers multicast the replies back. Instead of: <handshake> hello server, nice to meet you. I'd like such-and-such. it's actually: hello? servers? I'd like such-and-such! I'm ready, so go ahead and send it whenever... Not knowing who your server is kind of runs counter to the SSM ideal. It could be done with a pool of servers, I've just never seen it. The exchanges are particularly slow-moving when it comes to changing things. The modern exchange feed, particularly ones like the "touch tone" example I cited are literally ticker-tape punch signals wrapped up in an IP multicast header. The old school scheme was to have a ticker tape machine hooked to a "line" from the exchange. Maybe you'd have two of them (A and B again). There would be a third one for retransmit. Ticker machine run out of paper? Call the exchange, and here's more-or-less what happens: Cut the chunk of paper containing the updates you missed out of their spool of tape. Scissors are involved here. Grab a bit of header tape that says: "this is retransmit data for XYZ Bank". Tape these two pieces of paper together, and feed them through a reader that's attached to the "retransmit line" Every bank in New York will get the retransmits, but they'll know to ignore them. XYZ Bank clips the retransmit data out of the retransmit ticker machine, and pastes it into place on the end where the machine ran out of paper. These terms "tick" "line" and "retransmit", etc... all still apply with modern IP based systems. I've read the developer guides for these systems (to write wireshark decodes), and it's like a trip back in time. Some of these systems are still so closely coupled to the paper-punch system that you get chads all over the floor and paper cuts all over your hands just from reading the API guide :-) _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf