Business Intelligence SIG: Analytics: SQL or NoSQL

Business Intelligence SIG: Analytics: SQL or NoSQL

6:30 PM – 9:00 PM September 21, 2010
SAP – Southern Cross Room
3410 Hillview Avenue
Palo Alto, CA 94304, CA
Title: Analytics: SQL or NoSQL

Change is in the air. The relational database that has long been the mainstay of data warehouses has a new challenger. The NoSQL movement has implemented new systems for storing and processing data mostly based on the map data structure. On the surface, the difference between a map and a relation is just the difference between a tuple and an n-tuple. In practice we use them in different ways as we will explore.

The presentation starts by looking the relational database and SQL. We look at the ideas on which relational databases are based to understand why we use them, and go through an example of how they are used for analytics. Then we will look at tuple store based No-SQL systems and again understand how they are used and what they are used for. Finally we will compare the two and see that each has their place in an analytics architecture and develop the criteria for deciding which one to use and where.

Speaker: Richard Taylor

Richard has been a chair of the SDForum Business Intelligence SIG for 10 years, and has spoken to the SIG on BI related topics a number of times.

Over the last 20 years, he has been involved in building database systems on distributed and multiprocessor hardware at DEC, Data-Cache, RedBrick Systems, Informix, IBM and currently at SenSage Inc. Prior to that he worked on research projects in parallel and distributed computing.

Richard has a PhD in Computer Science from Cambridge University and a BSc in Computer Science from Manchester University in the UK.

Session: Bigger Than The Cloud

Convener: J Chris Anderson http://jchrisa.net
Notes-taker: Everyone

We started by introducing ourselves. We had a lot of people from the tech side (back end analytics, to web developers). We also had a few analysts and business / VC folks. Good crowd.

Then I handed everyone a sheet of yellow paper and and sheet of green paper. The instructions were to write drawbacks or warning signs about the cloud on the yellow paper, and good things, or “the promise of the cloud” / what users want from the cloud, on the green paper.

Here’s the notes everyone wrote:

The main limits of the cloud were:

* Few players owning everything
* User trust of cloud operators
* User lock-in to particular services
* Network unreliability / latency
* Non-portability of applications on a particular platform
* No single identity model

The biggest promise of the cloud was:

* Easy and fun communication
* Standard interface (the web browser)
* Scalable programming models
* Effiecient resource usage (green)
* Users’ data is safe from loss

Then I presented a way of thinking about the cloud that could turn our expectations upside down. Imagine a device that allows you to carry an entire copy of the internet around with you. Then it doesn’t matter if your connectivity is interrupted (you wouldn’t notice). You could also intentionally turn off your connection, so you can focus on what you already know, rather than being constantly interrupted with new information.

The discussion turned to filtering mechanisms for this cloud. Since you can’t actually carry the whole internet in your pocket, how do you know which subset to carry?

There is also a relevant scaling law: the amount of data people are using on a daily basis is growing faster than connection speeds. This means that you’ve got to have some mechanism to work with data in an offline manner, because you can’t always shove it over the wire in an on-demand way.

Session: Graph Databases

Big Data Workshop, April 23, 2010
Graph Databases – 3A

Title: Graph Databases
Convener: Johannes Ernst
Notes-taker: Dylan Clendenis

Notes:

Different representations:
Relational

  • tables
  • rows, columns
  • SQL

Heirarchical

  • parent/child

Graph

  • nodes, edges, directions

Graph Primitives

  • basic manipulation: set/get properties, “bless/unbless” relationship
between nodes
  • traversal: given a node, get set of neighbor nodes, subset by type
or property value

It is important to make the conceptual shift from “querying” to traversing

Session: Data Processing Model Besides Map/Reduce

Big Data Workshop, April 23, 2010
Session 5A

Title: Data Processing Model Besides M/R
Convener: Stanley Poon
Notes-taker: Stanley
Notes:

  • M/R is relatively new
  • Limited in way to partition problem
  • Latency high
  • Not enough parallelism: Map ha to finish before reduce
  • IBM – tool to reduce many existing algorithms to map reduce. Project name unknown at time.
  • CloudComp 2009 has some papers on comparing map reduce with MPI
  • Graph processing model as a more general, Pregel from Google.
  • An example using M/R to process video streams: Streams are processed by mappers and fed to reducers. Each reducers will put the frames into sequence. Frame boundary is a natural demarcation to break down stream.

Session: Scalable Search

Big Data Workshop, April 23, 2010
Session 3C

Convener: David Hardtke
Notes-taker: David Hardtke

Notes:
Our main topic was the tradeoff between search latency, data latency and cash when it comes to scalable search solutions.  Search latency is the time to execute a query.  Data latency is the time to make new data searchable.  Cash is the the cost of a solution.  The goal was to identify scalable search techniques for big data that do not involve caching of the search index.

Session: New Apps Enabled by Scalable Database

Big Data Workshop, April 23, 2010
Session 1G

Title: New Apps Enabled by Scalable Database
Convener: Doug Judd / Andy Lee
Notes-taker(s): Matthew Gonzales

Notes:

  • Social Apps are most popular with those using app engine
  • Observation – Loud in room G session 1 with construction noise in the background
  • Geographical location based games are enabled and popular with Scalable Databases
  • Gov’t, Medical…want decision engine and not as interested in storing data
  • What does it mean to say “big data” what size is considered big?
  • Observation – Hard to know who has what experience while discussion is going. Should have started with introductions first.
  • Pluto is no longer a planet

Session: Migrating From Small Data to Large – What Grows Well and How?

Big Data Workshop, April 23, 2010
Session 5B

Title: Migrating From Small Data to Large – What Grows Well and How?
Convener: Rob Brackett
Notes-taker: Ashley Frank

Notes:
Session was formed out of the desire of one to discuss possibilities of migration from MySQL with heavy analysis to Big Data technology.

View was expressed and supported that some kind of standards need to evolve to have some interoperability between Big Data implementations but it was realized that because there is not much consistency in the contracts and expectations so these standards would be tough to form.

Comparisons between Amazon’s EC2 and Google’s App Engine, App Engine supports Java and Python, EC2 is a virtualized environment for full operating systems.

Comparisons between Microsoft’s Azure and EC2 were discussed. Azure has 2 storage options, Azure Storage (big data style) with REST interface and Azure SQL Server.

Starting on new projects it is recommended to start with the new technology rather than prototyping on old relational technology and migrating.

Why use the new technology?
If you have elastic demand pricing is better. If you will need to sale large it is easy. It solves some licensing and hardware headaches.

Other Thoughts:
Data needs to be near the applications that use it if the data transfer is large or data transfer costs rise.

Personal Observations:
If you have ever tried to implement the data warehousing strategy for relational databases where you denormalize your fact tables and horizontally partition them, you soon realize that you have broken your ability to join or use indexes in the way you used to and have pretty much abandoned the query optimizer. Your fact tables can only be queried one way and any analytics or other joining must be done after the only selective key that works (the leading column(s) of your partitions) is used. You may end up with redundant data not just in a single view but multiple fact tables to express different views. The result is that you are moving toward this new approach we are calling Big Data with all the baggage of Relational but none of the innovation of Big Data solutions. However it seems that the dimension support and analysis is done by ‘other’ tools or custom solutions in the new Big Data world. Relational vendors could document the path large data requires in the evolution toward denormalization and partitioning of the fact table and at a point, provide the option to migrate the fact table to a Big Data technology and provide the glue for interacting with the Big Data with relational vocabulary and dimensional support as well as other infrastructure for using Big Data new school.

Session: Is the File System Dead? if so what replaces it?

Big Data Workshop, April 23, 2010
Session 3D

Title: Is the File System Dead? if so what replaces it?
Convener: Rich Ramos
Notes-taker: Rich Ramos

In our increasingly distributed, mobile, BIG DATA world has the traditional file system outlived it usefulness?

File System use cases:
User’s storage of fixed content & unstructured data
Application data store
Disk block management

Of course
Does the file system work well for any of those use cases anymore or are there better ways?

User’s:
The virtual concept of “Folders/Directories” grew out of the physical world equivalents, however in the brave new world of digital data Folders are no longer very useful. Things like “tags” “labels” and search are better for these purposes.

2) Applications:
Applications that are NOT DBMSes would prefer to use relational databases and, increasing, Key-Value stores. DBMSes/OSes/VMMs would rather do the management itself, rather than a file system.

Without the first two the third, disk block management, goes away.

Summary: of course file systems don’t go away anytime soon, but you can see how their usefulness might greatly decrease over time.



Session: Transitioning to Cloud Datastores (No SQL)

Big Data Workshop, April 23, 2010
Session 3I

Title: Transitioning to Cloud Datastores (No SQL)
Convener: Nika Jones / Fred Sauer
Notes-taker: Rob Brackett

Notes:

Are there public benchmarks for cloud storage?
* No, it’s all too different to compare well
* Yahoo has a key/value benchmark that came out recently
* Paper: http://research.yahoo.com/files/ycsb.pdf
* Summary: http://www.slideshare.net/kevinhan/yahoo-cloud-serving-benchmark

Most people have a SQL background → people need guidance with these new solutions

Fundamental differences between NoSQL and SQL?
* Distributedness? → Debatable. Oracle rep argues SQL has nothing to do with implementation. Joins can be distributed.
* ACID compliance? → Absolutely. This has to be given up to be distributed which almost all these new solutions are.

New tools for new problems — you can do different things with NoSQL tools.

So where is AppEngine better, for example, than Oracle?
* Elasticity (non-regular usage)
* What about basic storage/retrieval… how is it better than, say, Oracle? Aren’t the NoSQL guys just reinventing all the same stuff [eventually]?
* Query response time is independent of data set size

Guy from Craigslist talked about how they faced a lot of issues trying to scale with MySQL
* Split databases up by use cases (i.e. the different sections of the site: jobs, for sale, etc.)
* Problem came when trying to identify spam
* Queries became too big and spanned too many data sources to work in real time
* Used MySQL a key/value store to track each heuristic and decide at run time when those needed to be consulted
* Looked at Memcached and then Redis as solutions
* Also tried a product called “Despam,” which uses MySQL, but mostly in a key/value style, as in the home-grown solution above.

With SQL solutions, the general answer to scaling tends to be sharding, but you still have manage those shards yourself, replicate for redundancy, etc.

NoSQL can be viewed as fundamentally sharded, which makes that approach elementary
* You go the opposite direction and figure out how put all the little pieces together, instead of break them down and split them up
* Sharding is, then, no longer a hard thing
* Isn’t there a lot of mental effort/human time required to rethink for this approach?
* Yes, but only initially—it’s a one-time, up-front cost
* And only because people know SQL better right now (see early points)
* This is fine for small business; much harder for enterprise
* AppEngine is well proved for small companies and startups, but not widely used by enterprise
* Hard to choose a solution—all the NoSQL approaches are optimized for very different things and it can be hard to compare (see early points on benchmarks)
* Oracle guys did some tests. Were any technologies harder to code against?
* Amazon’s stuff was a bit more complex to get set up, but not necessarily harder
* AppEngine rep (Fred) thinks there needs to be and will be standards for cloud/distributed storage. They will all standardize and become commodities.
* AppEngine is optimized for small reads and writes, not big batch jobs, analytics, etc.
* Serving these other needs is Google’s next big task with AppEngine
* Other NoSQL solutions are similarly broadening the range of solutions, e.g. Cassandra and HBase are rapidly approaching each other from opposite directions.

How can NoSQL approaches better handle ad-hoc queries?
* Map/Reduce is the go-to answer, but doesn’t scale well for maintaining a wide variety of queries. Craigslist staff couldn’t have handled programming thousands of different map/reduce jobs.
* Josh (Craigslist) built a complex system for joining across different data sources
* What about standard ETL tools?
* Seems lots of people don’t know about them (e.g. Craigslist didn’t at the time)
* PIG? (SQL-like interface to Hadoop)

Session: Adding Structure With Hadoop and Cassandra at Rackspace

Big Data Workshop, April 23, 2010
Session 3G

Title: Adding Structure With Hadoop and Cassandra at Rackspace
Convener: @stuhood

Presentation:

WordPress Themes