Business Intelligence SIG: Analytics: SQL or NoSQL
6:30 PM – 9:00 PM September 21, 2010
SAP – Southern Cross Room
3410 Hillview Avenue
Palo Alto, CA 94304, CA
Title: Analytics: SQL or NoSQL
Change is in the air. The relational database that has long been the mainstay of data warehouses has a new challenger. The NoSQL movement has implemented new systems for storing and processing data mostly based on the map data structure. On the surface, the difference between a map and a relation is just the difference between a tuple and an n-tuple. In practice we use them in different ways as we will explore.
The presentation starts by looking the relational database and SQL. We look at the ideas on which relational databases are based to understand why we use them, and go through an example of how they are used for analytics. Then we will look at tuple store based No-SQL systems and again understand how they are used and what they are used for. Finally we will compare the two and see that each has their place in an analytics architecture and develop the criteria for deciding which one to use and where.
Speaker: Richard Taylor
Richard has been a chair of the SDForum Business Intelligence SIG for 10 years, and has spoken to the SIG on BI related topics a number of times.
Over the last 20 years, he has been involved in building database systems on distributed and multiprocessor hardware at DEC, Data-Cache, RedBrick Systems, Informix, IBM and currently at SenSage Inc. Prior to that he worked on research projects in parallel and distributed computing.
Richard has a PhD in Computer Science from Cambridge University and a BSc in Computer Science from Manchester University in the UK.
Big Data Workshop, April 23, 2010
Title: Hbase – What? Why? Where does it fit?
Convener: Jon Gray Notes-taker: Mason Ng
- Jon’s background – His startup used Postgres and had problem. He then started to use Hbase. Joined Facebook.
- List of things to discuss:What is Hbase – distributed, column oriented, scalable database
- Coming from relational world perspective
- Hbase does not use local filesystems, only use HDFS
- Cassandra uses local filesystems and then replicate itself. One read could be pulling from 3 nodes. AP (Availabile, Partition) systemHbase uses HDFS replication. CA (Consistent, Available) system
- No transaction across row. No ACID.
- Within a row, the transaction is Atomic.
- Acid property, row-> column family – c1->[versions…], c2->[versions….]
- Column family has its own file(s)
- Relational random reads. Hbase/HDFS access sequential writes/reads.
- Buffering writes. Inserts buffers in-memory and then batch write (flush) to HDFS. Table breaks into shards. Each shards only on one node.
- rowA-rowD in memory flushes 64MB to disk
- Update by versioning. Does not actually delete/update.
- How bigtable and hbase differ. Hbase is in Java.
- Bigtable in C++. Lots of crap needs to manage with Java and therefore hbase.Use zookeeper
- Supports random and sequential access. Supports RDF.
- Hbase background tasks compaction – takes lots of 64MB shards and compacts to one big chucks (about 3 shards) split – redistribute a-b and b-d to different shards
- Data model
- Processing of hadoop. Needs random access. Could not scale relational.
- Blog / RSS aggregators. Source, stamp, itemid
- Source a, b, c insert in random. Ends up with large btree. Merge 2000 sources and each source has 10000 items. Relational then delegates to query engine.
- Wants source A has this list of items – column oriented data. Items are sorted/mapped. Item stamp->id. Source (key) A is a large row across a sequential (3) 64MB chucks.
- Schema is not fixed but delegates to the application to create/extend the schemas.
- Time is last. Wants to get latest copies. May not be efficient because needs to skip.
- FB is not using Hbase but evaluating.
- @FB Hadoop shop. HDFS committers. Using Hive. Use hbase for the needs of incremental update.
- Hbase vs. Cassandra – needs eventual consistency then use hbase. Log shipping replication for colo replication. Hbase provides audit trails for slave replication.
- Hbase slower comparing to Cassandra on random access due to HDFS layer.
- Hbase and Cassandra should converge over time. Cassandra would get better at range scan. Hbase would get better at random access.
- 100 nodes per cluster. Use of segmenting. Stumbleupon has a cluster for website. Copy data to another offline cluster for mapreduce processing.
- Special table lives on one node. If there is a hot row, then the special table could be bottleneck. Use zookeeper for the special table due to read replication with zookeeper to distribute the reads.
- Hadoop namenode single point of failure. GFS2 uses bigtable to store the meta. GFS1->bigtable->GFS2->biggertable. How to scale namenode and make it highly available.
- Hbase vs. hypertable. Hypertable is written in C++. Community is supported by strong apache community.
- Subject to GC performance due to Java dependency on hadoop/hbase. Total throughput depends on GC performance. G1 garbage collector coming, defrag memory.
- Hbase does not need to run with hadoop. Hbase could run on local filesystem on one node.
- There is data loss until next Hbase release due to HDFS limitation such as append write to the log file. If writer dies before a close() then data loss could occur.
- @FB Hive and Hbase integration. Also looking at HBQL. ORM build on top of
- Hbase. Hadoop + Hbase stack can provide rollup archive.
Big Data Workshop, April 23, 2010
Session - 1G
Title: New Apps Enabled by Scalable Database
Convener: Doug Judd / Andy Lee
Notes-taker: Matthew Gonzales
- Social Apps are most popular with those using app engine
- Observation – Loud in room G session 1 with construction noise in the background
- Geographical location based games are enabled and popular with Scalable Databases
- Gov’t, Medical…want decision engine and not as interested in storing data
- What does it mean to say “big data” what size is considered big?
- Observation – Hard to know who has what experience while discussion is going. Should have started with introductions first.
- Pluto is no longer a planet
Big Data Workshop
Notes Taker Form
If taking notes electronically onto this from please ‘save as’ using name of session and email this document as an attachment to: BDWnotes@gmail.com and put the session number and location letter in the subject line. If taking notes by hand please come to Documentation Center to type them up on a computer – it will only take a few minutes – Thank You!
Session Number_____ Session Location _____
. Tags for the session – technology discussed/ideas considered:
. Discussion notes, key understandings, outstanding questions, observations, and, if appropriate to this discussion: action items, next steps:
When people registered we asked them to share the questions they have about Big Data here they are below. Hopefully these questions will be addressed by the attendees gathered at the Big Data Workshop next Friday
- Nevermind big data, what about big metadata?
- How can the enterprise data management industry better serve the web’s data management problems?
- Interested in developments in the area of big data, what applications have beendeveloped, migration of large structured RDBMS to NoSQL, if transaction based processing is a consideration. This is coming from the perspective of working for a large investment firm and wanting to determine the applicability of these technologies to “”traditional”" RDBMS environments and what other opportunities existfor possible deployment.
- How can we best evaluate the design tradeoffs underlying different nosql technologies?
- Will DataWarehouse and Transactional converge?
- What are people buying today for large DBs?
- What are current best practices for bioinformatics and sequence analysis?
- What new data technologies are emerging, being adopted or not?and more importantly, why.
- Interested in hearing how other people cope with large quantities of data, especially withrespect to storage on Amazon Web Services (S3), doing real-time analytics and such.
We also asked topics people wanted to present about their own work related to Big Data:
- Creator of Hypertable, a high performance, open source implementation of Google’s Bigtable. Will highlight the difference between Hypertable and the otherscalable database alternatives.
- A super efficient, scalable, database with excellent map reduce properties.
- Biology: Now Under Moore’s Law — Informatics is now the bottleneck for High-ThroughputSequencing (HTS)
- How we are implementing data cubes for traffic data and using it for real-time analytics
- HBase committer so can present on that. Also work with big data at Facebook which I can speakabout in general but not specifically what I work on.
- Semantic Web technologies and Data Integration
- Redis via Java client, and its integration with other frameworks in this area
- upcoming features in Cassandra 0.7 and 0.8
- interested to see what other people are doing in the field