Big Data Workshop, April 23, 2010
Title: Transitioning to Cloud Datastores (No SQL)
Convener: Nika Jones / Fred Sauer
Notes-taker: Rob Brackett
Are there public benchmarks for cloud storage?
* No, it’s all too different to compare well
* Yahoo has a key/value benchmark that came out recently
* Paper: http://research.yahoo.com/files/ycsb.pdf
* Summary: http://www.slideshare.net/kevinhan/yahoo-cloud-serving-benchmark
Most people have a SQL background → people need guidance with these new solutions
Fundamental differences between NoSQL and SQL?
* Distributedness? → Debatable. Oracle rep argues SQL has nothing to do with implementation. Joins can be distributed.
* ACID compliance? → Absolutely. This has to be given up to be distributed which almost all these new solutions are.
New tools for new problems — you can do different things with NoSQL tools.
So where is AppEngine better, for example, than Oracle?
* Elasticity (non-regular usage)
* What about basic storage/retrieval… how is it better than, say, Oracle? Aren’t the NoSQL guys just reinventing all the same stuff [eventually]?
* Query response time is independent of data set size
Guy from Craigslist talked about how they faced a lot of issues trying to scale with MySQL
* Split databases up by use cases (i.e. the different sections of the site: jobs, for sale, etc.)
* Problem came when trying to identify spam
* Queries became too big and spanned too many data sources to work in real time
* Used MySQL a key/value store to track each heuristic and decide at run time when those needed to be consulted
* Looked at Memcached and then Redis as solutions
* Also tried a product called “Despam,” which uses MySQL, but mostly in a key/value style, as in the home-grown solution above.
With SQL solutions, the general answer to scaling tends to be sharding, but you still have manage those shards yourself, replicate for redundancy, etc.
NoSQL can be viewed as fundamentally sharded, which makes that approach elementary
* You go the opposite direction and figure out how put all the little pieces together, instead of break them down and split them up
* Sharding is, then, no longer a hard thing
* Isn’t there a lot of mental effort/human time required to rethink for this approach?
* Yes, but only initially—it’s a one-time, up-front cost
* And only because people know SQL better right now (see early points)
* This is fine for small business; much harder for enterprise
* AppEngine is well proved for small companies and startups, but not widely used by enterprise
* Hard to choose a solution—all the NoSQL approaches are optimized for very different things and it can be hard to compare (see early points on benchmarks)
* Oracle guys did some tests. Were any technologies harder to code against?
* Amazon’s stuff was a bit more complex to get set up, but not necessarily harder
* AppEngine rep (Fred) thinks there needs to be and will be standards for cloud/distributed storage. They will all standardize and become commodities.
* AppEngine is optimized for small reads and writes, not big batch jobs, analytics, etc.
* Serving these other needs is Google’s next big task with AppEngine
* Other NoSQL solutions are similarly broadening the range of solutions, e.g. Cassandra and HBase are rapidly approaching each other from opposite directions.
How can NoSQL approaches better handle ad-hoc queries?
* Map/Reduce is the go-to answer, but doesn’t scale well for maintaining a wide variety of queries. Craigslist staff couldn’t have handled programming thousands of different map/reduce jobs.
* Josh (Craigslist) built a complex system for joining across different data sources
* What about standard ETL tools?
* Seems lots of people don’t know about them (e.g. Craigslist didn’t at the time)
* PIG? (SQL-like interface to Hadoop)