HBase - Overview - BunksAllowed

BunksAllowed is an effort to facilitate Self Learning process through the provision of quality tutorials.

Community

HBase - Overview

Share This

Limitations of Hadoop


Hadoop can only handle data in batches, and it can only access data in a sequential fashion. This implies that even for the most straightforward tasks, one must explore the entire dataset.

Processing one massive dataset leads to another, equally massive dataset that needs to be handled in a sequential manner. To access any point of data in a single unit of time, a new approach is currently required (random access).


Hadoop Random Access Databases


Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.


Introduction to HBase


HBase is a column-oriented distributed database, constructed upon the Hadoop file system. It is horizontally scalable and an open-source project.

Similar to Google's big table, HBase is a data model created to offer rapid random access to enormous volumes of structured data. It makes use of the Hadoop File System's (HDFS) fault tolerance.

Random real-time read/write access to data stored in the Hadoop File System is made possible by this component of the Hadoop ecosystem. The data can be directly stored in HDFS or via HBase. Data consumers use HBase to read and access data from HDFS at random. HBase offers read and write access and is layered on top of the Hadoop File System.


HBase and HDFS


HDFS is a distributed file system suitable for storing large files. HBase is a database built on top of the HDFS. HDFS does not support fast individual record lookups. HBase provides fast lookups for larger tables.

It provides high latency batch processing. It provides low latency access to single rows from billions of records (Random access).

It provides only sequential access to data. HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups.


Storage Mechanism in HBase


The tables in HBase are sorted by row and the database is column-oriented. The key value pairs, or column families, are the only ones defined by the database schema. There are several column families in a table, and the number of columns in each column family is unlimited. The values of subsequent columns are kept consecutively on the disk. A timestamp is included with each value in a table cell. To sum up, within an HBase:
  • Table is a collection of rows. 
  • Row is a collection of column families. 
  • Column family is a collection of columns. 
  • Column is a collection of key value pairs.

Column Oriented and Row Oriented


Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database & Column-Oriented Database

It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing (OLAP). Such databases are designed for a small number of rows and columns. Column-oriented databases are designed for huge tables.

The following image shows column families in a column-oriented database: HBase and RDBMS
 
HBase is schema-less, it doesn't have the concept of fixed columns schema; it defines only column families. An RDBMS is governed by its schema, which describes the whole structure of tables. It is built for wide tables. 
 
HBase is horizontally scalable. It is thin and built for small tables. Hard to scale. No transactions are there in HBase. RDBMS is transactional. It has denormalized data. It will have normalized data. It is good for semi-structured as well as structured data. It is good for structured data.


Features of HBase


  • HBase scales linearly. 
  • It offers automated support for failures. 
  • It offers reliable reading and writing. 
  • It is both a source and a destination for Hadoop integration.
  • Its client Java API is simple to use. 
  • Data replication between clusters is provided by HBase.

Usage of HBase

  •  Big Data is accessed randomly and in real time via Apache HBase. 
  • Large tables are hosted on top of clusters of commodity hardware. 
  • A non-relational database called Apache HBase is based on Google Bigtable. Bigtable operates on top of the Google File System, and Apache HBase operates on top of HDFS and Hadoop.

Applications of HBase

  • It is employed when writing complex programs is required.
  • Every time we need to offer quick random access to the data, we use HBase. 
  • Internal users of HBase include Yahoo, Adobe, Facebook, Twitter, and Yahoo.


HBase - Architecture 


The four main parts of HBase are master, zookeeper, region servers, and regions. 
 
The HBase Master allocates areas and handles load balancing, while the Apache Zookeeper keeps an eye on everything. Read and write data is served via the Region server. Every machine in the Hadoop cluster is the region server. It is made up of several files as well as Region, HLog, Store, and MemoryStore. This entire thing is a component of the HDFS file system.

HBase Architectural Component: Regions 

HBase datasets are horizontally partitioned into "Regions" according to the row key range. Regions are allocated to the cluster nodes, which are referred to as "Region Servers." These servers provide read-only and write-accessible data. 
 

HBase Architectural Component: Region Server 

The Region Server does the following jobs. 
  • Interact with the client and manage tasks pertaining to data. 
  • Take care of read and write requests for each region that falls under it.
  • Determine the size of the region by using the region size thresholds. 
Every machine in the Hadoop cluster is the region server. It is made up of several files as well as Region, HLog, Store, and MemoryStore (Memstore). This entire thing is a component of the HDFS file system. Memstore functions similarly to cache memory. Anything entered into HBase is initially kept here. Afterwards, the memstore is flushed and the data is moved and stored as blocks in Hfiles.

HBase Architectural Component: Master Server 

The Master Server does the following jobs. 
  • Assigns regions to the region servers, using Apache ZooKeeper to assist in this process
  • Manages the regional load balancing amongst region servers. It moves the regions to less populated servers and unloads the busy servers.
  • Negotiates load balancing to maintain the cluster's state. 
  • It is in charge of making modifications to the schema as well as other metadata actions like making tables and column families. 

HBase Architectural Component: Zookeeper 

The Zookeeper does the following jobs. 
  • An open-source project called Zookeeper offers many services like naming, distributed synchronization, and configuration information maintenance. 
  • Ephemeral nodes in Zookeeper represent various region servers. These nodes are used by master servers to find servers that are available. 
  • The nodes are used to monitor server failures and network partitions in addition to availability. 
  • It is used by clients to communicate with region servers. 
  • It will be handled by HBase in standalone and pseudo modes. 
  • ZooKeeper receives a heartbeat signal from an active HMaster, indicating that it is operational. 
  • The inactive server serves as a backup. In case of failure of active HMaster, it will come to rescue. 
  • When region servers are prepared to read and write operations, they notify ZooKeeper of this status.


Happy Exploring!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.