Big data analytics going 100x faster with Hive and Stinger
Apache implemented Hive as data warehouse platform for analysis of data using SQL, on the top of Hadoop map-reduce framework for data mining and data preparation use cases. These usage patterns remain very important but with widespread adoption of Hadoop, the enterprise requirement for Hadoop to become more real time or interactive has increased its importance as well.
Hence, Hortonworks came up with Apache Stinger Initiative which enables hive to answer queries within 5-30 seconds. This includes queries like big data exploration, visualization, and parameterized reports.
The Stinger Initiative is a collection of development threads in the Hive community that will deliver 100 times performance improvements with SQL compatibility. It is a live project with THREE defined phases to be delivered over the next few months all in the open community (Phase1 results have already demonstrated an initial 45x improvement). Following architecture provides the overview:
Stringer worked on optimization and enabled the optimizer to automatically pick the map join. It also introduced in-memory hash join that reads the small table into a hash table and scans through the big file to generate output.
With Stinger, Hive is more suitable to deliver the decision support queries people want to perform in Hadoop. This includes:
- SQL Compliance
- Support window/analytical functions (OVER clause)
– Multiple PARTITION BY and ORDER BY supported
– Windowing supported (ROWS PRECEDING/FOLLOWING)
– Aggregates Functions (RANK, FIRST_VALUE, LAST_VALUE, LEAD / LAG)
- Data Types:
– Add fixed point NUMERIC and DECIMAL type and size ranges from 1 to 53 for FLOAT
– Add VARCHAR and CHAR types with limited field size
– Added synonyms for compatibility (BLOB for BINARY, TEXT for STRING, REAL for FLOAT etc.)
- SQL Semantics:
– Sub-queries in IN, NOT IN, HAVING.
– EXISTS and NOT EXISTS
Vector Query Engine
Stinger outlines a new architecture for the Hive query execution engine. It removes process buffers and allows Hive to speed records processed per second.
Generates more intelligent DAGs (Directed Acyclic Graph) based on properties of data being queried, e.g. table size, statistics, histograms, etc.
Generally, metadata and small dimension tables are frequently used in queries. Service built into YARN or TEZ buffer frequently used data into memory so it is not always read from disk.
Tez eliminates Hive’s latency and throughput constraints that results from its reliance on MapReduce. Tez optimizes Hive job execution by eliminating unnecessary tasks, synchronization barriers, and reads from and write to HDFS. This optimizes the execution chain within Hadoop and drastically speeds up Hive’s workload processing. It does not write intermediate output to HDFS and hence lightens disk and network usage. It also enables the pipelining of jobs.
Apache introduces a new columnar file format (i.e. ORCFile) within the Hive community to provide a more modern, efficient, and high performing way to store Hive data. Benefits of ORC Files
- Tightly aligned to Hive data model
- Decompose complex row types into primitive fields for better compression and projection
- Only read bytes from HDFS for the required columns.
- Store column level aggregates in the files.
With every passing day, the new discoveries around big data analytics is getting better, and in the times to come we are going to see the hype on big data finally reach its Peak of Inflated Expectations on the Gartner’s Hype cycle. But soon enough, enlightenment and productivity phase will follow. Big Data Analytics is the future with the vast amount of data that is being generated at a rapid pace. And its time when organizations start preparing ahead of time, start preparing to use their own big data by custom dashboard designs.