Posts tagged: MySQL

Session: Migrating From Small Data to Large – What Grows Well and How?

Big Data Workshop, April 23, 2010
Session 5B

Title: Migrating From Small Data to Large – What Grows Well and How?
Convener: Rob Brackett
Notes-taker: Ashley Frank

Notes:
Session was formed out of the desire of one to discuss possibilities of migration from MySQL with heavy analysis to Big Data technology.

View was expressed and supported that some kind of standards need to evolve to have some interoperability between Big Data implementations but it was realized that because there is not much consistency in the contracts and expectations so these standards would be tough to form.

Comparisons between Amazon’s EC2 and Google’s App Engine, App Engine supports Java and Python, EC2 is a virtualized environment for full operating systems.

Comparisons between Microsoft’s Azure and EC2 were discussed. Azure has 2 storage options, Azure Storage (big data style) with REST interface and Azure SQL Server.

Starting on new projects it is recommended to start with the new technology rather than prototyping on old relational technology and migrating.

Why use the new technology?
If you have elastic demand pricing is better. If you will need to sale large it is easy. It solves some licensing and hardware headaches.

Other Thoughts:
Data needs to be near the applications that use it if the data transfer is large or data transfer costs rise.

Personal Observations:
If you have ever tried to implement the data warehousing strategy for relational databases where you denormalize your fact tables and horizontally partition them, you soon realize that you have broken your ability to join or use indexes in the way you used to and have pretty much abandoned the query optimizer. Your fact tables can only be queried one way and any analytics or other joining must be done after the only selective key that works (the leading column(s) of your partitions) is used. You may end up with redundant data not just in a single view but multiple fact tables to express different views. The result is that you are moving toward this new approach we are calling Big Data with all the baggage of Relational but none of the innovation of Big Data solutions. However it seems that the dimension support and analysis is done by ‘other’ tools or custom solutions in the new Big Data world. Relational vendors could document the path large data requires in the evolution toward denormalization and partitioning of the fact table and at a point, provide the option to migrate the fact table to a Big Data technology and provide the glue for interacting with the Big Data with relational vocabulary and dimensional support as well as other infrastructure for using Big Data new school.

Session: Transitioning to Cloud Datastores (No SQL)

Big Data Workshop, April 23, 2010
Session 3I

Title: Transitioning to Cloud Datastores (No SQL)
Convener: Nika Jones / Fred Sauer
Notes-taker: Rob Brackett

Notes:

Are there public benchmarks for cloud storage?
* No, it’s all too different to compare well
* Yahoo has a key/value benchmark that came out recently
* Paper: http://research.yahoo.com/files/ycsb.pdf
* Summary: http://www.slideshare.net/kevinhan/yahoo-cloud-serving-benchmark

Most people have a SQL background → people need guidance with these new solutions

Fundamental differences between NoSQL and SQL?
* Distributedness? → Debatable. Oracle rep argues SQL has nothing to do with implementation. Joins can be distributed.
* ACID compliance? → Absolutely. This has to be given up to be distributed which almost all these new solutions are.

New tools for new problems — you can do different things with NoSQL tools.

So where is AppEngine better, for example, than Oracle?
* Elasticity (non-regular usage)
* What about basic storage/retrieval… how is it better than, say, Oracle? Aren’t the NoSQL guys just reinventing all the same stuff [eventually]?
* Query response time is independent of data set size

Guy from Craigslist talked about how they faced a lot of issues trying to scale with MySQL
* Split databases up by use cases (i.e. the different sections of the site: jobs, for sale, etc.)
* Problem came when trying to identify spam
* Queries became too big and spanned too many data sources to work in real time
* Used MySQL a key/value store to track each heuristic and decide at run time when those needed to be consulted
* Looked at Memcached and then Redis as solutions
* Also tried a product called “Despam,” which uses MySQL, but mostly in a key/value style, as in the home-grown solution above.

With SQL solutions, the general answer to scaling tends to be sharding, but you still have manage those shards yourself, replicate for redundancy, etc.

NoSQL can be viewed as fundamentally sharded, which makes that approach elementary
* You go the opposite direction and figure out how put all the little pieces together, instead of break them down and split them up
* Sharding is, then, no longer a hard thing
* Isn’t there a lot of mental effort/human time required to rethink for this approach?
* Yes, but only initially—it’s a one-time, up-front cost
* And only because people know SQL better right now (see early points)
* This is fine for small business; much harder for enterprise
* AppEngine is well proved for small companies and startups, but not widely used by enterprise
* Hard to choose a solution—all the NoSQL approaches are optimized for very different things and it can be hard to compare (see early points on benchmarks)
* Oracle guys did some tests. Were any technologies harder to code against?
* Amazon’s stuff was a bit more complex to get set up, but not necessarily harder
* AppEngine rep (Fred) thinks there needs to be and will be standards for cloud/distributed storage. They will all standardize and become commodities.
* AppEngine is optimized for small reads and writes, not big batch jobs, analytics, etc.
* Serving these other needs is Google’s next big task with AppEngine
* Other NoSQL solutions are similarly broadening the range of solutions, e.g. Cassandra and HBase are rapidly approaching each other from opposite directions.

How can NoSQL approaches better handle ad-hoc queries?
* Map/Reduce is the go-to answer, but doesn’t scale well for maintaining a wide variety of queries. Craigslist staff couldn’t have handled programming thousands of different map/reduce jobs.
* Josh (Craigslist) built a complex system for joining across different data sources
* What about standard ETL tools?
* Seems lots of people don’t know about them (e.g. Craigslist didn’t at the time)
* PIG? (SQL-like interface to Hadoop)

WordPress Themes