Hive: The SQL-like Data Ware­house Tool for Big Data

Data Ware­house Tool for Big Data The man­age­ment of Big Data is cru­cial if en­ter­prises are to ben­e­fit from the huge vol­umes of data they gen­er­ate each day. Hive is a tool built on top of Hadoop that can help to man­age this data.

OpenSource For You - - Contents -

Hive is a data ware­house in­fra­struc­ture tool to process struc­tured data in Hadoop. It re­sides on top of Hadoop to sum­marise Big Data, and makes query­ing and analysing easy. A lit­tle his­tory about Apache Hive will help you un­der­stand why it came into ex­is­tence. When Face­book started gath­er­ing data and in­gest­ing it into Hadoop, the data was com­ing in at the rate of tens of GBs per day back in 2006. Then, in 2007, it grew to 1TB/day and within a few years in­creased to around 15TBs/day. Ini­tially, Python scripts were writ­ten to in­gest the data in Or­a­cle data­bases, but with the in­creas­ing data rate and also the di­ver­sity in the sources/types of in­com­ing data, this was be­com­ing dif­fi­cult. The Or­a­cle in­stances were get­ting filled pretty fast and it was time to de­velop a new kind of sys­tem that han­dled large amounts of data. It was Face­book that first built Hive, so that most peo­ple who had SQL skills could use the new sys­tem with min­i­mal changes, com­pared to what was re­quired with other RDBMs. The main fea­tures of Hive are:

It stores schema in a data­base and pro­cesses data into HDFS. It is de­signed for OLAP.

It pro­vides an SQL-type lan­guage for query­ing, called HiveQL or HQL.

It is fa­mil­iar, fast, scal­able and ex­ten­si­ble.

Hive ar­chi­tec­ture is shown in Fig­ure 1.

The com­po­nents of Hive are listed in Ta­ble 1.

The im­por­tance of Hive in Hadoop

Apache Hive lets you work with Hadoop in a very ef­fi­cient man­ner. It is a com­plete data ware­house in­fra­struc­ture that is built on top of the Hadoop frame­work. Hive is uniquely placed to query data, and per­form powerful anal­y­sis and data sum­mari­sa­tion while work­ing with large vol­umes of data. An in­te­gral part of Hive is the HiveQL query, which is an SQL-like in­ter­face that is used ex­ten­sively to query what is stored in data­bases.

Hive has the dis­tinct ad­van­tage of de­ploy­ing high-speed data reads and writes within the data ware­houses while man­ag­ing large data sets that are dis­trib­uted across mul­ti­ple lo­ca­tions, all thanks to its SQL-like fea­tures. It pro­vides a struc­ture to the data that is al­ready stored in the data­base. The users are able to con­nect with Hive us­ing a com­mand line tool and a JDBC driver.

How to im­ple­ment Hive

First, down­load Hive from http://apache.claz.org/hive/ sta­ble/. Next, down­load apache-hive-1.2.1-bin.tar. gz 26-Jun-2015 13:34 89M . Ex­tract it man­u­ally and re­name the folder as hive.

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.