Session: Graph Databases

Big Data Workshop, April 23, 2010
Graph Databases – 3A

Title: Graph Databases
Convener: Johannes Ernst
Notes-taker: Dylan Clendenis


Different representations:

  • tables
  • rows, columns
  • SQL


  • parent/child


  • nodes, edges, directions

Graph Primitives

  • basic manipulation: set/get properties, “bless/unbless” relationship
between nodes
  • traversal: given a node, get set of neighbor nodes, subset by type
or property value

It is important to make the conceptual shift from “querying” to traversing

Session: Data Processing Model Besides Map/Reduce

Big Data Workshop, April 23, 2010
Session 5A

Title: Data Processing Model Besides M/R
Convener: Stanley Poon
Notes-taker: Stanley

  • M/R is relatively new
  • Limited in way to partition problem
  • Latency high
  • Not enough parallelism: Map ha to finish before reduce
  • IBM – tool to reduce many existing algorithms to map reduce. Project name unknown at time.
  • CloudComp 2009 has some papers on comparing map reduce with MPI
  • Graph processing model as a more general, Pregel from Google.
  • An example using M/R to process video streams: Streams are processed by mappers and fed to reducers. Each reducers will put the frames into sequence. Frame boundary is a natural demarcation to break down stream.

Session: Scalable Search

Big Data Workshop, April 23, 2010
Session 3C

Convener: David Hardtke
Notes-taker: David Hardtke

Our main topic was the tradeoff between search latency, data latency and cash when it comes to scalable search solutions.  Search latency is the time to execute a query.  Data latency is the time to make new data searchable.  Cash is the the cost of a solution.  The goal was to identify scalable search techniques for big data that do not involve caching of the search index.

Session: New Apps Enabled by Scalable Database

Big Data Workshop, April 23, 2010
Session 1G

Title: New Apps Enabled by Scalable Database
Convener: Doug Judd / Andy Lee
Notes-taker(s): Matthew Gonzales


  • Social Apps are most popular with those using app engine
  • Observation – Loud in room G session 1 with construction noise in the background
  • Geographical location based games are enabled and popular with Scalable Databases
  • Gov’t, Medical…want decision engine and not as interested in storing data
  • What does it mean to say “big data” what size is considered big?
  • Observation – Hard to know who has what experience while discussion is going. Should have started with introductions first.
  • Pluto is no longer a planet

Session: Migrating From Small Data to Large – What Grows Well and How?

Big Data Workshop, April 23, 2010
Session 5B

Title: Migrating From Small Data to Large – What Grows Well and How?
Convener: Rob Brackett
Notes-taker: Ashley Frank

Session was formed out of the desire of one to discuss possibilities of migration from MySQL with heavy analysis to Big Data technology.

View was expressed and supported that some kind of standards need to evolve to have some interoperability between Big Data implementations but it was realized that because there is not much consistency in the contracts and expectations so these standards would be tough to form.

Comparisons between Amazon’s EC2 and Google’s App Engine, App Engine supports Java and Python, EC2 is a virtualized environment for full operating systems.

Comparisons between Microsoft’s Azure and EC2 were discussed. Azure has 2 storage options, Azure Storage (big data style) with REST interface and Azure SQL Server.

Starting on new projects it is recommended to start with the new technology rather than prototyping on old relational technology and migrating.

Why use the new technology?
If you have elastic demand pricing is better. If you will need to sale large it is easy. It solves some licensing and hardware headaches.

Other Thoughts:
Data needs to be near the applications that use it if the data transfer is large or data transfer costs rise.

Personal Observations:
If you have ever tried to implement the data warehousing strategy for relational databases where you denormalize your fact tables and horizontally partition them, you soon realize that you have broken your ability to join or use indexes in the way you used to and have pretty much abandoned the query optimizer. Your fact tables can only be queried one way and any analytics or other joining must be done after the only selective key that works (the leading column(s) of your partitions) is used. You may end up with redundant data not just in a single view but multiple fact tables to express different views. The result is that you are moving toward this new approach we are calling Big Data with all the baggage of Relational but none of the innovation of Big Data solutions. However it seems that the dimension support and analysis is done by ‘other’ tools or custom solutions in the new Big Data world. Relational vendors could document the path large data requires in the evolution toward denormalization and partitioning of the fact table and at a point, provide the option to migrate the fact table to a Big Data technology and provide the glue for interacting with the Big Data with relational vocabulary and dimensional support as well as other infrastructure for using Big Data new school.

Session: Is the File System Dead? if so what replaces it?

Big Data Workshop, April 23, 2010
Session 3D

Title: Is the File System Dead? if so what replaces it?
Convener: Rich Ramos
Notes-taker: Rich Ramos

In our increasingly distributed, mobile, BIG DATA world has the traditional file system outlived it usefulness?

File System use cases:
User’s storage of fixed content & unstructured data
Application data store
Disk block management

Of course
Does the file system work well for any of those use cases anymore or are there better ways?

The virtual concept of “Folders/Directories” grew out of the physical world equivalents, however in the brave new world of digital data Folders are no longer very useful. Things like “tags” “labels” and search are better for these purposes.

2) Applications:
Applications that are NOT DBMSes would prefer to use relational databases and, increasing, Key-Value stores. DBMSes/OSes/VMMs would rather do the management itself, rather than a file system.

Without the first two the third, disk block management, goes away.

Summary: of course file systems don’t go away anytime soon, but you can see how their usefulness might greatly decrease over time.

Session: Transitioning to Cloud Datastores (No SQL)

Big Data Workshop, April 23, 2010
Session 3I

Title: Transitioning to Cloud Datastores (No SQL)
Convener: Nika Jones / Fred Sauer
Notes-taker: Rob Brackett


Are there public benchmarks for cloud storage?
* No, it’s all too different to compare well
* Yahoo has a key/value benchmark that came out recently
* Paper:
* Summary:

Most people have a SQL background → people need guidance with these new solutions

Fundamental differences between NoSQL and SQL?
* Distributedness? → Debatable. Oracle rep argues SQL has nothing to do with implementation. Joins can be distributed.
* ACID compliance? → Absolutely. This has to be given up to be distributed which almost all these new solutions are.

New tools for new problems — you can do different things with NoSQL tools.

So where is AppEngine better, for example, than Oracle?
* Elasticity (non-regular usage)
* What about basic storage/retrieval… how is it better than, say, Oracle? Aren’t the NoSQL guys just reinventing all the same stuff [eventually]?
* Query response time is independent of data set size

Guy from Craigslist talked about how they faced a lot of issues trying to scale with MySQL
* Split databases up by use cases (i.e. the different sections of the site: jobs, for sale, etc.)
* Problem came when trying to identify spam
* Queries became too big and spanned too many data sources to work in real time
* Used MySQL a key/value store to track each heuristic and decide at run time when those needed to be consulted
* Looked at Memcached and then Redis as solutions
* Also tried a product called “Despam,” which uses MySQL, but mostly in a key/value style, as in the home-grown solution above.

With SQL solutions, the general answer to scaling tends to be sharding, but you still have manage those shards yourself, replicate for redundancy, etc.

NoSQL can be viewed as fundamentally sharded, which makes that approach elementary
* You go the opposite direction and figure out how put all the little pieces together, instead of break them down and split them up
* Sharding is, then, no longer a hard thing
* Isn’t there a lot of mental effort/human time required to rethink for this approach?
* Yes, but only initially—it’s a one-time, up-front cost
* And only because people know SQL better right now (see early points)
* This is fine for small business; much harder for enterprise
* AppEngine is well proved for small companies and startups, but not widely used by enterprise
* Hard to choose a solution—all the NoSQL approaches are optimized for very different things and it can be hard to compare (see early points on benchmarks)
* Oracle guys did some tests. Were any technologies harder to code against?
* Amazon’s stuff was a bit more complex to get set up, but not necessarily harder
* AppEngine rep (Fred) thinks there needs to be and will be standards for cloud/distributed storage. They will all standardize and become commodities.
* AppEngine is optimized for small reads and writes, not big batch jobs, analytics, etc.
* Serving these other needs is Google’s next big task with AppEngine
* Other NoSQL solutions are similarly broadening the range of solutions, e.g. Cassandra and HBase are rapidly approaching each other from opposite directions.

How can NoSQL approaches better handle ad-hoc queries?
* Map/Reduce is the go-to answer, but doesn’t scale well for maintaining a wide variety of queries. Craigslist staff couldn’t have handled programming thousands of different map/reduce jobs.
* Josh (Craigslist) built a complex system for joining across different data sources
* What about standard ETL tools?
* Seems lots of people don’t know about them (e.g. Craigslist didn’t at the time)
* PIG? (SQL-like interface to Hadoop)

Session: Adding Structure With Hadoop and Cassandra at Rackspace

Big Data Workshop, April 23, 2010
Session 3G

Title: Adding Structure With Hadoop and Cassandra at Rackspace
Convener: @stuhood


Session: Hbase – What? Why? Where does it fit?

Big Data Workshop, April 23, 2010
Session 2C
Title: Hbase – What? Why? Where does it fit?

Convener: Jon Gray    Notes-taker: Mason Ng

  • Jon’s background – His startup used Postgres and had problem. He then started to use Hbase. Joined Facebook.
  • List of things to discuss:What is Hbase – distributed, column oriented, scalable database
  • Coming from relational world perspective
  • Hbase does not use local filesystems, only use HDFS
  • Cassandra uses local filesystems and then replicate itself. One read could be pulling from 3 nodes. AP (Availabile, Partition) systemHbase uses HDFS replication. CA (Consistent, Available) system
  • No transaction across row. No ACID.
  • Within a row, the transaction is Atomic.
  • Acid property, row-> column family – c1->[versions…], c2->[versions….]
  • Column family has its own file(s)
  • Relational random reads. Hbase/HDFS access sequential writes/reads.
  • Buffering writes. Inserts buffers in-memory and then batch write (flush) to HDFS. Table breaks into shards. Each shards only on one node.
  • rowA-rowD in memory flushes 64MB to disk
  • Update by versioning. Does not actually delete/update.
  • How bigtable and hbase differ. Hbase is in Java.
  • Bigtable in C++. Lots of crap needs to manage with Java and therefore hbase.Use zookeeper
  • Supports random and sequential access. Supports RDF.
  • Hbase background tasks compaction – takes lots of 64MB shards and compacts to one big chucks (about 3 shards) split – redistribute a-b and b-d to different shards
  • Data model
  • Processing of hadoop. Needs random access. Could not scale relational.
  • Blog / RSS aggregators. Source, stamp, itemid
  • Source a, b, c insert in random. Ends up with large btree. Merge 2000 sources and each source has 10000 items. Relational then delegates to query engine.
  • Wants source A has this list of items – column oriented data. Items are sorted/mapped. Item stamp->id. Source (key) A is a large row across a sequential (3) 64MB chucks.
  • Schema is not fixed but delegates to the application to create/extend the schemas.
  • Time is last. Wants to get latest copies. May not be efficient because needs to skip.
  • FB is not using Hbase but evaluating.
  • @FB Hadoop shop. HDFS committers. Using Hive. Use hbase for the needs of incremental update.
  • Hbase vs. Cassandra – needs eventual consistency then use hbase. Log shipping replication for colo replication. Hbase provides audit trails for slave replication.
  • Hbase slower comparing to Cassandra on random access due to HDFS layer.
  • Hbase and Cassandra should converge over time. Cassandra would get better at range scan. Hbase would get better at random access.
  • 100 nodes per cluster. Use of segmenting. Stumbleupon has a cluster for website. Copy data to another offline cluster for mapreduce processing.
  • Special table lives on one node. If there is a hot row, then the special table could be bottleneck. Use zookeeper for the special table due to read replication with zookeeper to distribute the reads.
  • Hadoop namenode single point of failure. GFS2 uses bigtable to store the meta. GFS1->bigtable->GFS2->biggertable. How to scale namenode and make it highly available.
  • Hbase vs. hypertable. Hypertable is written in C++. Community is supported by strong apache community.
  • Subject to GC performance due to Java dependency on hadoop/hbase. Total throughput depends on GC performance. G1 garbage collector coming, defrag memory.
  • Hbase does not need to run with hadoop. Hbase could run on local filesystem on one node.
  • There is data loss until next Hbase release due to HDFS limitation such as append write to the log file. If writer dies before a close() then data loss could occur.
  • @FB Hive and Hbase integration. Also looking at HBQL. ORM build on top of
  • Hbase. Hadoop + Hbase stack can provide rollup archive.

Session: Introduction to Hypertable  Q&A

Big Data Workshop, April 23, 2010
Session 3F

Titles: Introduction to Hypertable  Q&A
Convener: Doug Judd
Notes-taker: Matthew Gonzales


  • Hypertable is an opensource implementation of Google’s Bigtable
  • Hyperspace is equivalent to Google’s Chubby System
  • Hypertable performance features include: Implemented in C++, Query Cache, Block Cache, Bloom Filter
  • Hypertable has been around for 3 years1.0 Release for July
  • Hypertable Large deployments include – Baidu and Rediff
  • How is Hypertable compared to Hbase?  Ans:  C++ vs. Java / Hypertable chose c++ for performance reasonsConsidering offering Hypertable as hosted solution in the cloud
  • Does Hypertable simulate nodes or do you have to have multiple servers?  Ans:  You can run it on a laptop

WordPress Themes