Skip to main content

First Impressions from NoSQL Live

Today I drove up to Boston for the day to attend NoSQL Live. My experience so far within the NoSQL community has been limited to what we've built in-house at Disney and ESPN over the past decade to solve our scaling issues, more recently has been ESPN's use of Websphere eXtreme Scale, and the very latest has been my own experimentation with HBase which hasn't gotten much further than setting up a four node cluster. I've read a little about Cassandra, memcached, Tokyo Cabinet and that's about it. So before the sandman wipes away most of my first impressions of the technologies discussed today, I wanted to record my thoughts for posterity or, at the very least, tomorrow.


Cassandra
Cassandra seems to be the hottest NoSQL solution this month with press about both Twitter and Digg running implementations. My impression, I'm wary of "eventual consistency". I don't feel I understand the risk and ramifications well enough to design a system properly. When Jonathan Ellis of Rackspace Cloud mentioned that Digg needed to implement Zookeeper-based locking on top of Cassandra so that diggs get recorded correctly, I realized how poorly I understand eventual consistency and how risky it could be. But my impression of Cassandra isn't all negative, it definitely seems to have less baggage than HBase by not being built on top of HDFS. I'll get into what that means a little later.

Memcached
Unfortunately the speaker that 'represented' memcached gave off a vibe that really turned me off to the product. I know that's incredibly shallow, but this is first impressions after all and not perfectly-evaluated impressions. Mark Atwood sat on the first panel of the day "Scaling with NoSQL" and his whole attitude seemed to say "memcached is all you'll ever need and these guys next to me are just overdesigning hacks". His answers were short and his tone was quite condescending even when addressing audience questions. Not a very good first impression of him. But luckily today wasn't my first impression of memcached as I was pointed in its direction just last week by a Disney colleague. My research before today has me intrigued about using it as a replacement for ehcache as a second-level cache provider to Hibernate which we use as an ORM in one system at ESPN right now.

Document Oriented Databases (Riak, CouchDB, MongoDB)
Wow, this is a subgroup of NoSQL technology that I had heard of in passing but was really unaware of what problem they were trying to solve. Riak had the best answers for scaling and operational-ease. With homogeneous nodes and consistent hashing, Riak promises that adding and removing nodes are seemless. CouchDB and MongoDB sounded like a 'me too' answer so I'm interested to find out what that really means for each, or better yet what it doesn't mean. But the concepts of document-oriented databases really meshes well with ESPN's current fantasy user database. Our fantasy user profiles are stored in a traditional RDBMS as serialized maps of maps, one row for each user. Since its serialized to a BLOB column its completely opaque to reporting and analytics. To keep that model but have vendor support for divining information and having transparency into it sounds exciting. I really need to look into these. Riak definitely won this round of first impressions.

Tokyo Cabinet
This was a technology I was referred to by a colleague and read through their site last week. I was far from impressed then since it seems much too low-level for my taste, similar to my impressions of Carbonado which we use at ESPN. The lightning talk by Flinn Mueller got me a little more interested. He seems to be doing interesting things but from an analytics and reporting perspective. He was vague on how loads the data from his primary store and what the scale of the data is, so my first impression: its a toy. I'm sure that's an unfair characterization but I'm not trying to be fair tonight. But honestly, Tokyo Cabinet makes no bones that it punts on horizontal scaling which is the deal breaker for me.

Hypertable
I looked at Hypertable (as in read their website) about 18 months ago on the suggestion of a colleague when discussing HBase around the same time. This conference didn't change my opinion, which is "It's HBase but written in C". It doesn't seem to bring anything else to the table which to me is a blocker. JVM implementations are available for all the operating systems I use and so I don't like the idea of needing to find the right binary to download for a given box. When it comes to Java vs. C, I choose Java but I'm also extremely biased as I've been a Java developer nearly my entire career.

Full-stack JavaScript
This was my favorite of the lightning talks, and possibly my favorite of the conference. It felt a little tangential to the NoSQL topic, mainly because Jim Wilson covered more than just data storage. The idea, what if you could use JavaScript on your server, in your client and use JSON for talking between the layers and as the storage format? Crazy, right? I say brilliant. His few slides were mildly embarrassing that dissected each of the popular stacks of today by how many languages you need to learn (Java, XML, etc) as well and the various impedance mismatches between layers (ORM, Object marshaling to JSON or XML, etc). "ORM is an antipattern" was an enlightening take on something I've accepted as necessary. Full-stack JavaScript is something I'll be lusting after for a long time, especially since he made it sounds so attainable with node.js, rhino and MongoDB. As soon as his slides are online I'll be linking to them as well as passing them around the office.

HBase
Well I saved HBase for last. It's the one I've had the most experience with, though that experience can still be measured in hours. As I hinted at earlier, this conference gave me the first impression that HDFS is a weight around the neck of HBase. I was surprised to get that feeling from the room, since my impression has been purely positive so far. It is also getting a lot of flack from the 'single point of failure' problem associated with the current HDFS architecture's Name Node. Apparently performance is a dog since it was "only" designed to be highly distributed with no promise of when you'll get your data. This burden seems to carry over to HBase. But after talking to Ryan Rawson one-on-one at the end of the HBase lab, it's clear he is of the strong opinion that its getting a bad wrap. He also makes very convincing arguments about the scale of what HBase is currently doing in real production environments vs. competitors like Cassandra. It's very pursuasive and you can read more of the details in a very active thread I kicked off on the HBase user group earlier this week.

Conclusion
HBase is still the front-runner of my personal candidates for a NoSQL option for ESPN as it has been for a long time. Cassandra's design choice of eventual consistency is a little scary to me because I don't know yet how to design for it, not because it is inherently bad choice. Documented-oriented databases just made a big blip on my radar. Memcached is interesting if I want to stick with a traditional ORM-based architecture. Tokyo Cabinet and Hypertable are all but off my radar. And the lusty vixen of them all is a full-stack JavaScript architecture.

Disclaimer: Though I mention my employer ESPN in this post, these are my own personal opinions and don't represent the opinions of the company. The final decision on this stuff is "above my pay-grade" as they say.

Comments

Anonymous said…
The low level characterization of Tokyo Cabinet is fair but don't let that shy you away from looking at it seriously. I wouldn't consider it a toy. It's a real tool that helps us avoid classic locking, join and speed issues associated with certain types of data problems. I don't (at the moment) use it as our primary store only because to make the switch would be an entire rewrite of a legacy system.
Flinn, small world, right? Hope you didn't take my tongue-in-cheek commentary the wrong way! :-) But the low-level API isn't a deal-breaker for me, its the punting on horizontal scaling that is. Ideally I don't want to have to reinvent the 'scaling' wheel yet again. I want a performant online store that also allows ad-hoc reporting queries against it while dynamically and transparently scaling linearly. Call me a dreamer! :-)
Unknown said…
Brian, I met you briefly at NoSQL Live. My buddy's name at ESPN is Jeff Mahoney, but it sounds like you guys have a pretty big campus, so it's not likely you'd ever see him.

Anyway, I was scribbling on my whiteboard for 4 days straight this past summer trying to wrap my head around eventual consistency, multiple entity versions, and vector clocks.

These articles and talks by the CTO at Amazon were helpful:

http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

http://www.infoq.com/presentations/availability-consistency

And then, somewhat unrelated, I really liked this article about the Big Table type of architectures.

http://highscalability.com/how-i-learned-stop-worrying-and-love-using-lot-disk-space-scale

In the end, we decided to prototype and build on both the eventual consistency architecture for availability and the Big Table architecture for ACID requirements, trading away consistency and availability respectively. We're planning to expose both of these systems to our middleware through a unified interface. Data will be handled by our own db management system which will choose the appropriate data model architecture based on the specifications given to it by the client code.

Incidently, we are not directly running any of the systems featured at NoSQL Live. We are indirectly using them through platform services at Google, Amazon, and RackSpace.

We'll see how it goes, but prototyping and testing has shown really great results. At the very least, it's been a lot of fun.

Also, we are implementing a large portion of our stack in JavaScript, and I could not be happier with it. It feels strange to publicly admit how much I like JavaScript, but it is really a great overall tool.

We use Narwhal (http://github.com/280north/narwhal) and try to follow the new CommonJS specs as much as possible. (http://commonjs.org/)

Kris Walker
fireworksfactory.blogspot.com
Unknown said…
Good article Brian. I went to one overview session on NoSQL at a conference last year but also included things like Object Oriented DBs rather than ORM software. Definitely interesting stuff especially in how many big name companies have abandoned SQL for alternative solutions (Amazon, Google, Facebook if Im not mistaken, and the list goes on).

Popular posts from this blog

3D Photo Viewer for Looking Glass

The Looking Glass I created my first Chrome extension, which is now live on the Chrome Web Store ! It's built for the Looking Glass , a holographic display that let's you view three-dimensional objects without glasses. I've also opened the source to the extension on GitHub. The Chrome extension allows you to view Facebook's "3D Photos", a feature they added in 2018 for displaying photos that include a depth map like those from phones with dual cameras, such as Apple's "Portrait Mode". Getting Started To use the extension, connect your Looking Glass to your computer, navigate to Facebook and open the viewer from the extension's popup menu. This will open a browser window on the Looking Glass display's screen in fullscreen mode. Opening the Viewer Once the viewer is open, the extension watches for any 3D Photo files being downloaded, so browse around Facebook looking for 3D Photos.  I recommend some of the Facebook groups de...

Simplifying logging with Maven and SLF4J (Part 2)

So in my  previous post  I explained how to simplify your logging with Maven and SLF4J. If you haven't read it yet, please do before reading more.  Since then I've discovered an easier and cleaner way to remove the secondary frameworks from your Maven dependency tree. Here's a revised overview of the steps: Decided which logging framework will be your primary, aka who will actually write to your log file. Define the dependency scope of all the secondary frameworks to be ' provided '. Configure your project to depend on drop-in replacements of each secondary framework from SLF4J. Define secondary frameworks as provided Use the dependencyManagement section for this. Its used when you might have a dependency transitively. Add dependency on SLF4J Add the following to your pom.xml Conclusion So now in only 3 steps you can redirect all your logging to your primary logging framework without changing a line of code!