Sunday, April 20, 2014

Buffer Cache Makes Slow Disks Seem Fast, Till You Need Them.

Linux has this wonderful thing called the buffer cache (for more detail read here ). In summary, it uses all your free ram as a cache for file access. Because of buffer cache you can easily get under 1 millisecond response times.
However, this sets a lot of people up for a trap. Imagine you buy a “database server” with a 5400RPM hard disk at Best Buy, while you’re there you pick up an extra 8 gigs of RAM. After loading the latest Ubuntu and restoring a 1 gig customer database backup. You check to see how much RAM you’re using and you have 2 gigs free. You test the new server out and records are coming off that server at an unbelievable speed, your happy, your boss is happy, you look like a genius.
Later that month your company acquires a new firm, you write a script to load their 3 gigs worth of customer data into your same customer database. The next thing you know your server is behind, your website is timing out, your data takes forever to come back, but when you go to access a common page you use for testing the responses are instant. What gives?
Before your files were all getting accessed in RAM effectively, maybe on initial restart of the server things were slow, but eventually all data was stored in RAM and the workload of the computer never exceeded the latencies provided by RAM, once you could only cache part of your frequently accessed data you entered a weird world where some data is coming back in 1 millisecond and some data is coming back in 2 seconds, because the disk is never able to catch up, nor normally would have been able to catch up under normal workloads if you had no buffer cache. However, you’ve never encountered this scenario previously so you never realized your server could never hope to keep up without having a ton of RAM to throw at the problem. You call your original Linux mentor and he has you go buy some good SSDs, install them in your server and once you restore from backup everything is running fine, not as fast as before but no one can tell the difference. Why the big difference?
Because the buffer cache was hiding the bad disk configuration from the get go, and once that little 5400 RPM hard disk had to get some real work done it quickly fell behind and was never able to catch up.
This happens more frequently than you’d think and I’ve seen people who are super happy with their 6 figure SANs until they have an application that exceeds their buffer cache and they quickly find to their horror, their expensive SAN is really terrible for latency sensitive workloads which most databases are ( a good background lesson on the importance of latency is here).


The lesson here is if you ever want to benchmark how a system will do at times of stress, start with a dataset that you can’t fit into buffer cache so you’ll know how it performs when using the disk directly.

Reads and the perils of index tables.

I frequently see index tables in Cassandra being used to allow a One Source Of Truth. It’s important to remember when designing a truly distributed system relational algebra really doesn’t scale, and in memory joins will only get you so far (very little really). So as we often do in relational systems when we have an expensive dataset that is to expensive to calculate on the fly we create a materialized view. In Cassandra it’s helpful to think this way for all datasets that would require joins in a relational system.
Since Cassandra is write optimized lets take a look a typical social networking pattern with a “users stream” and see how we’d model with traditional normalization of data and how this looks in Cassandra with denormalized data.
user_stream_modeling
Instead of many many round trips to the database as we query each index table and then query the results of that index, then pulling data across many nodes into one query, which could take on the order of seconds, we can pull from one table and get results back in under 1 millisecond. Furthermore we can get optimized ordering on the post date and that users partition and get things back very quickly indeed, we’d have to order the comments client side, but even on the biggest comment threads that’d be a fast operation (if this won’t work there are other modeling options from here).
So about now you’re probably screaming bloody murder about how much work this is on updates and that updates for a 1000 users following a post will result in 1000 writes to the database, but I say so what. Cassandra is extremely write optimized and while there is a level at which this modeling may become expensive and require other concessions this will get you light years farther than the normalized approach, where the cost of reads will drag you down far before writing that much will (realize that most workloads are read more than write). But what about consistency? What if my update process only writes the post to 990 followers and then fails? Do I need batch processes to do consistency checks later?

Consistency through BATCH.

Cassandra offers the BATCH keyword. Batches are atomic, when they fail mid update, they will rollback. If the node or client fall down (or both) a hint at the start of this process that allows other nodes to pick the rollback of that batch up and finish the process.
If I assume a blog model and I want to display posts by tag and username. I can update both tables every time a post_title changes in one go, assuming I have the full post information, which is why putting this in the save path for posts is the perfect place for this to go.
post_modeling_example
BATCH BEGIN
  UPDATE posts_by_username SET post_title = 'Cassandra Consistency' WHERE username = 'rsvihla' AND post_id = '5f148a02-ccec-4c98-944d-6431cd86ad5c'
  UPDATE posts_by_tag SET post_title = 'Cassandra Consistency' WHERE tag='Scaling' post_id = '5f148a02-ccec-4c98-944d-6431cd86ad5c'
  UPDATE posts_by_tag SET post_title = 'Cassandra Consistency' WHERE tag='Cassandra' post_id = '5f148a02-ccec-4c98-944d-6431cd86ad5c'
APPLY BATCH
Now there is probably a practical size limit on batches so once you start having to update more than 100 rows at a time you want to consider batches of batches, which will lose you the easy consistency of one atomic batch, however, you will get that in the given batch size.

Summary



This is a quick tour of data modeling at scale and doing so with Cassandra. There are a lot more use cases and variants of this, but this is the basic idea.

How to become in polyglot in 5 hard steps.

With today’s world of programming languages where many languages are better at certain tasks than other’s you’ll find it useful to learn multiple languages over the course of your career (as well as keeping your skill sets current).
Here are some tips I’ve had to learn the hard way:

Step 1: Your second language should be similar to your first

Your brain will confuse a lot of language decisions with computational necessities. Warts from your first language will show more, and syntax will make your head explode.

Step 2: Compare and contrast common task based libraries to get a gist of the differences.

Namely, ORMs, Web Frameworks, Unit testing libraries, Xml reading/writing, Csv reading/writing, Http/REST clients, templating languages, email sending, and I’m sure a few more I’m forgetting. Doing this will not only teach you language specific idioms quickly, it’ll also give your brain a chance to see the similarities and what bits of data are really necessary to do task X or Y.

Step 3: Get involved with the community

Get on mailing lists or better still go to user groups in your area. See what the programmers in that community are obsessed with (I’m looking at you Python and your giant PEP 8 discussion about style), ask foolish questions. Learning a language is a lot about fitting into a community. You may be the determined to bring some concepts from your mother language in, but first learn how to treat your new language as a second mother first.

Step 4: Write several simplistic projects that replace the common big frameworks.

This is an extension of the last couple steps.  Write a unit test library, an ORM, a web framework, rest library, etc, make them simple enough to just work barely, but focus on what you think is a good client API heavily. They will suck, have someone who’s good at that language tell you why it sucks.

Step 5: Learn a third language and a fourth, fifth…

The third should be totally earth shatteringly different and ideally solve some task for you that is hard otherwise in your other languages (this is a great time to learn functional programming). You’ll learn a lot more about programming in general this way. Repeat steps 2 through 4.