Work and Life: Understanding HBase: First look

HBase

Apache HBase™ is the NoSQL database, a database built on top of Hadoop and HDFS File System. A distributed, scalable, big data store.

It's commonly called as key-value store, a column-family-oriented database, a versioned maps of maps database.

Why HBase

My system has well normalized SQL database, however the tables have grown huge, queries require table joins and it takes lot of time.
My system needs to perform random reads on a large SQL table. The queries are consuming more time. The buffering is not much helpful due to the random reads.
Addition and removal of entities in SQL tables is becoming frequent, due to the large nature of the tables, the maintenance cost is increasing.
The tables are growing rally big, I can't manage the storage requirement with the commodity hardware, I may have to upgrade to super/specialized computers.

If you face any of the above mentioned few problems, then HBase should rescue you.

HBase

Allows fast random reads and writes.
Although HBase allows fast random writes, it is read optimized
Built on top of Hadoop system which solves the horizontal scaling issue.
HBase is a NoSQL database, loosely coupled and no joins.
Is good for large/sparse data sets.
HBase assures Consistency and partion tolerance of CAP theorem.

HBase Examples

Finding the 50 largest items in a group of 2 billion records
Finding the non-zero items representing less than 0.1% of a huge collection
Messaging Platform

Who is using HBase

http://radar.oreilly.com/2014/04/5-fun-facts-about-hbase-that-you-didnt-know.html

HBase Architecture

Key components of Hbase?

Zookeeper: It does the co-ordination work between client and Hbase Maser
Hbase Master: Hbase Master monitors the Region Server
RegionServer: RegionServer monitors the Region
Region: It contains in memory data store(MemStore) and Hfile.
Catalog Tables: Catalog tables consist of ROOT and META

HBase operational commands

get
put
scan
increment
delete

HBase and Hadoop

Fast random reads require the data to be stored structured (ordered).
The only possibility to modify a file stored on HDFS without rewriting is appending.
Fast random writes into sorted files only by appending seems to be impossible.
The solution to this problem is the Log-Structured Merge Tree (LSM Tree).
The HBase data structure is based on LSM Trees

The Log-Structured Merge Tree

The LSM Tree works the following way

All puts (insertions) are appended to a write ahead log(WAL) (can be done fast on HDFS, can be used to restore the database in case anything goes wrong)
An in memory data structure (MemStore) stores the most recent puts (fast and ordered)
From time to time MemStore is flushed to disk.

This results in the following structure
LSM-Tree

This results in a many small files on HDFS.
HDFS better works with few large files instead of many small ones.
A get or scan potentially has to look into all small files. So fast random reads are not possible as described so far.
That is why HBase constantly checks if it is necessary to combine several small files into one larger one
This process is called compaction. There are two different kinds of compactions.
Minor Compactions merge few small ordered files into one larger ordered one without touching the data.
Major Compactions merge all files into one file. During this process outdated or deleted values are removed.
Guarantees on the maximum number of compactions per entry can be made because of the way HBase triggers compactions.
Bloom Filters (stored in the Metadata of the files on HDFS) can be used for a fast exclusion of files when looking for a specific key.

HBase Vs MySQL

http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction

HBase vs RDBMS RDBMS HBase Data layout Row-oriented Column-oriented Transactions Multi-row ACID Single row only Query lang...

HBase Schema design

By now we know in theory what and when to use HBase. Let me try to solve some of the most commonly solved industry problems using HBase.

Twitter: User Follows and Followers.

Let’s try to build the schema for handling the twitter “Follows and followers” use case.

The schema should help us to answer some basic queries like

List all the users following a user
List all the users who follows a user
Give me the personal details of a user

Table: UserFollows
Row key => userId
Column Family: PersonalInfo, Follows

The table UserFollows uses the userID as the rowkey. It maintains two column families
PersonaInfo: To maintain all the personal details of a user
Follows: List of all the userId's followed.

This tables should help us to answer query [1] and [3], however the same table can't be used to answer [2], since it needs full table scan for answering each query. This would be very expensive.
So how do we address this issue ?
How about another table which maintains the Followed by relation ship, take a look at below table "UserFollowed"

Table: UserFollowed
Row key => userId
Column Family => FollowedBy

This table can easily answer query [2]. So with these tables we are able to answer all the queries.
Do you see any issue with these tables ?
- Data redundancy ?. Yes. However, it's important to note that only the row keys are indexed, for faster reads and writes it's still ok, if we end with data redundancy.

Since it's only the rowkey which is indexed, we can futher optimize by including the followers users within the rowkey. Example
"@anil+@sunil"
"@anil+@john"

In this way it becomes faster to scan through the indexed data and find who follows whom.

Flipkart: Sellers Catalog

TBD
Monitoring: IT Infrastructure Monitoring

TBD

References

Apache HBase
YMC
https://en.wikipedia.org/wiki/Apache_HBase
http://career.guru99.com/top-13-hbase-interview-questions/
http://www.edureka.co/blog/overview-of-hbase-storage-architecture/

Work and Life

Saturday, January 24, 2015

Understanding HBase: First look