Library

Course: From 0 to 1: Hive for Processing Big Data

From 0 to 1: Hive for Processing Big Data

  • Life Time Access
  • Certificate on Completion
  • Access on Android and iOS App
About this Course

Prerequisites: Hive requires knowledge of SQL. The course includes and SQL primer at the end. Please do that first if you don't know SQL. You'll need to know Java if you want to follow the sections on custom functions. 

Taught by a 4 person team including 2 Stanford-educated, ex-Googlers and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with large-scale data. 

Hive is like a new friend with an old face (SQL). This course is an end-to-end, practical guide to using Hive for Big Data processing. 

Let's parse that 

A new friend with an old face: Hive helps you leverage the power of Distributed computing and Hadoop for Analytical processing. It's interface is like an old friend : the very SQL like HiveQL. This course will fill in all the gaps between SQL and what you need to use Hive. 

End-to-End: The course is an end-to-end guide for using Hive: whether you are analyst who wants to process data or an Engineer who needs to build custom functionality or optimize performance - everything you'll need is right here. New to SQL? No need to look elsewhere. The course has a primer on all the basic SQL constructs, . 

Practical: Everything is taught using real-life examples, working queries and code . 

What's Covered: 

  • Analytical Processing: Joins, Subqueries, Views, Table Generating Functions, Explode, Lateral View, Windowing and more
  • Tuning Hive for better functionality: Partitioning, Bucketing, Join Optimizations, Map Side Joins, Indexes, Writing custom User Defined functions in Java. UDF, UDAF, GenericUDF, GenericUDTF, Custom functions in Python, Implementation of MapReduce for Select, Group by and Join
  • For SQL Newbies: SQL In Great Depth
  • Using discussion forums
  • Please use the discussion forums on this course to engage with other students and to help each other out. Unfortunately, much as we would like to, it is not possible for us at Loonycorn to respond to individual questions from students:-(
  • We're super small and self-funded with only 2 people developing technical video content. Our mission is to make high-quality courses available at super low prices.
  • The only way to keep our prices this low is to *NOT offer additional technical support over email or in-person*. The truth is, direct support is hugely expensive and just does not scale.
  • We understand that this is not ideal and that a lot of students might benefit from this additional support. Hiring resources for additional support would make our offering much more expensive, thus defeating our original purpose.

It is a hard trade-off.

Thank you for your patience and understanding!

Who is the target audience?

  • Yep! Analysts who want to write complex analytical queries on large scale data
  • Yep! Engineers who want to know more about managing Hive as their data warehousing solution 
Basic knowledge
  • Hive requires knowledge of SQL. If you don't know SQL, please head to the SQL primer at the end of the course first
  • You'll need to know Java if you are interested in the sections on custom user defined functions
  • No other prerequisites: The course covers everything you need to install Hive and run queries!
What you will learn
  • Write complex analytical queries on data in Hive and uncover insights
  •  Leverage ideas of partitioning, bucketing to optimize queries in Hive
  •  Customize hive with user defined functions in Java and Python 
  •  Understand what goes on under the hood of Hive with HDFS and MapReduce
Curriculum
Number of Lecture: 86
Total Duration: 15:04:26
You, Us & This Course
  • You, Us & This Course  

    We start with an introduction. What is the course about? What will you know at the end of the course?

Introducing Hive
  • Hive: An Open-Source Data Warehouse  

    Data warehousing systems - which have become the rage with the rise of 'Big Data' -  are quite different from traditional transaction processing systems. Hive is a prototypical data warehousing system.

  • Hive and Hadoop  

    Hive is built atop Hadoop, and can even be characterized as the SQL skin atop Hadoop MapReduce.

  • Hive vs Traditional Relational DBMS  

    Hive tries really hard - and mostly succeeds - at pretending to be a relational DBMS, but really, under the hood its quite different - understand how, and understand schema-on-read.

  • HiveQL and SQL  

    Now that we understand the differences between Hive and a traditional RDBMS, the differences between HiveQL and SQL will seem a lot less annoying and arbitrary.

Hadoop and Hive Install
  • Hadoop Install Modes  

    Before we install Hive, we need to install Hadoop. Hadoop has 3 different install modes - Standalone, Pseudo-distributed and Fully Distributed. Get an overview of when to use each

  • Hadoop Install Step 1 : Standalone Mode  

    How to set up Hadoop in the standalone mode. Windows users need to install a Virtual Linux instance before this video.

  • Hive install  

    If you are all set with Hadoop, let's go ahead and install Hive.

  • Code-Along: Getting started  

    Let's run a few basic queries on Hive. Head on over to the SQL primer section at the end of the course, if you have no previous experience in Hive.

Hadoop and HDFS Overview
  • What is Hadoop?  

    What exactly is Hadoop? Its origins and its logical components explained.

  • HDFS or the Hadoop Distributed File System  

    HDFS based on GFS (The Google File System) is the storage layer within Hadoop. It stores files in blocks of 128MB.

Hive Basics
  • Primitive Datatypes  

    Let's cycle through primitive datatypes in Hive.

  • Collections_Arrays_Maps  

    Hive has some really cool datatypes - collections that make it feel like there is a real programming language under the hood. Oh, and btw - there is!

  • Structs and Unions  

    Structs and unions are yet another bit of Hive that seem more at home in a programming language.

  • Create Table  

    Let's get into the nitty-gritty - starting with creating tables. Remember schema-on-read?

  • Insert Into Table  

    Inserting into tables has a few quirks in Hive, because, after all, all writes are just data dumps that know nothing about the schema

  • Insert into Table 2  

    More on inserts - remember that no schema checking happens during database writes!

  • Alter Table  

    Alter table works in Hive - understand how.

  • HDFS  

    Hive data is stored as files on HDFS, the distributed file system that is an integral part of Hadoop. Understanding the physical layout of hive tables will make many advanced concepts - bucketing and partitioning - far more clear.

  • HDFS CLI - Interacting with HDFS  

    Learn how to interact with HDFS. This comes in handy if you want to understand what's going on under the hood of your Hive Queries.

  • Code-Along: Create Table  

    Let's create a few tables and see how to insert data. We'll see external tables as well and what happens under the hood in HDFS with each of these activities.

  • Code-Along : Hive CLI  

    Hive CLI allows you to run scripts and execute queries directly from the command line rather than the hive shell.

Built-in Functions
  • Three types of Hive functions  

    Hive has a whole bunch of useful functions available out-of-the-box. This is an introduction to the 3 types of functions available. Standard, aggregate and table generating functions.

  • The Case-When statement, the Size function, the Cast function  

    The case-when statement is very useful to populate columns by evaluating conditions. Size() and Cast() are other useful built-in functions.

  • The Explode function  

    explode() is a very interesting table generating function which expands an array to produce row for every element in the array.

  • Code-Along : Hive Built - in functions  

    Code-Along : Hive Built - in functions

Sub-Queries
  • Quirky Sub-Queries  

    Sub-queries in Hive are rather quirky. For instance, union is fine, but intersect is not.

  • More on subqueries: Exists and In  

    Sub-queries have a few rather arcane rules - no equality signs, and some rather specific rules on exists and in.

  • Inserting via subqueries  

    It is possible to insert data into a table using subqueries - just don't try to specify any schema information!

  • Code-Along : Use Subqueries to work with Collection Datatypes  

    Code-Along : Use Subqueries to work with Collection Datatypes

  • Views  

    Views are an awesome bit of functionality in Hive - use them. Oh, btw, views are non-materialized, if that means anything to you. If not - never mind!

Partitioning
  • Indices  

    Indices are just a lot less important in Hive than they are in SQL. Understand why, and also how they can be used.

  • Partitioning Introduced  

    Partitioning in Hive is conceptually similar to Indexing in traditional DBMS - way to quickly look up rows with specific values in a particular column

  • The Rationale for Partitioning  

    Let's understand the why of partitioning

  • How Tables are Partitioned  

    Partitioning needs to specified at the time of table creation - understand the syntax.

  • Using Partitioned Tables  

    Once a table has been partitioned appropriately, using it is not a lot of work.

  • Dynamic Partitioning: Inserting data into partitioned tables  

    Inserting data into partitioned tables can be a bit tedious - understand how dynamic partitioning can help!

  • Code-Along : Partitioning  

    Let's see partitioning in action!

Bucketing
  • Introducing Bucketing  

    Bucketing is conceptually quite close to partitioning - and indeed to Indexing in traditional RDBMS - but with a key difference.

  • The Advantages of Bucketing  

    Bucketing has an important advantage over partitioning - the metastore is unlikely to be taken down by it.

  • How Tables are Bucketed  

    Bucketing needs to specified at the time of table creation - understand how.

  • Using Bucketed Tables  

    Once a table has been bucketed, using it is not that difficult.

  • Sampling  

    Sampling is a very handy technique in a data warehouse, and bucketing helps power this functionality

Windowing
  • Windowing Introduced  

    Windowing functions start to get at the real number-crunching power of Hive. In effect, they help tack on a new column to a query result - and that column contains the results of aggregate functions on a window of rows.

  • Windowing - A Simple Example: Cumulative Sum  

    Let's use windowing to set up a running total, aka a cumulative sum, for revenues in a sales table

  • Windowing - A More Involved Example: Partitioning  

    Let's now make that running sum reset each day - combining the power of windowing and the power of partitioning

  • Windowing - Special Aggregation Functions  

    Rownumber, rank, lead and lag - Hive places really nifty windowing functions at your disposal.

Understanding MapReduce
  • The basic philosophy underlying MapReduce  

    In the world of MapReduce every problem can be thought of in terms of key values pairs. Map transforms the key-value pair in a meaningful way, they are sorted and merged and reduce combines key-value pairs in a meaningful way.

  • MapReduce - Visualized and Explained  

    If you're learning MapReduce for the very first time - it's best to visualize what exactly it does before you get down into the little details.

  • MapReduce - Digging a little deeper at every step  

    What really goes on with a single record as it flows through the map and then reduce phase?

MapReduce logic for queries: Behind the scenes
  • MapReduce Overview: Basic Select-From-Where  

    In order to be able write optimal Hive queries, you need to know how a query runs behind the scenes. If you understand the MR job then you'll know how exactly to tweak the queries to get the best performance. This lecture covers the map reduce that is run for simple SELECT-FROM-WHERE queries. No code, just pure MapReduce logic.

  • MapReduce Overview: Group-By and Having  

    The data flow and logic behind the MapReduce for queries which include the GROUP-BY and HAVING clauses.

  • MapReduce Overview: Joins  

    Joins are super complicated and join optimization is hard to get right unless you understand what's happening under the hood. This lecture explains the logic behind the Join MapReduce operation.

Join Optimizations in Hive
  • Improving Join performance with tables of different sizes  

    How can you speed up your joins in Hive? If you have tables of different sizes there are whole bunch of little things that you can do.

  • The Where clause in Joins  

    The Where clause is evaluated after the Join, what are the implications of this on performance?

  • The Left Semi Join  

    The Left Semi Join is an efficient way to implement WHERE-IN and EXISTS queries. They work much faster and are more optimal.

  • Map Side Joins: The Inner Join  

    There are certain kinds of Joins which can be implemented as Map only Joins, the Reduce phase doesn't need to run at all in the underly MapReduce operaitions. These are highly performant and efficient joins. 

    Not all Joins are map side joins. There are certain conditions based on the type of Join and the size of the tables participating in the Join.

  • Map Side Joins: The Left, Right and Full Outer Joins  

    Under what conditions can the Left Outer, Right Outer and Full Outer joins be map side joins?

  • Map Side Joins: The Bucketed Map Join and the Sorted Merge Join  

    Even more optimal than map side joins are bucketed map joins. Bucketing the join tables speeds up the Join tremendously and sorting each bucket is even better!

Custom Functions in Python
  • Custom functions in Python  

    Custom functions in Hive can be implemented in Python as well as Java.

  • Code-Along : Custom Function in Python  

    Let's see how to implement and run a custom Python Script.

Custom functions in Java
  • Introducing UDFs - you're not limited by what Hive offers  

    Hive allows you to define your own custom functions which you can reference from Hive queries. This requires you to write code in a programming language like Python or Java.

  • The Simple UDF: The standard function for primitive types  

    The simple User Defined Function (UDF) implements a standard function (one row of input == one row of output) which only works with primitive data types. An overview of how this can be done with Hive provided Java libraries.

    Requires familiarity with Java before you can follow all that's happening.

  • The Simple UDF: Java implementation for replacetext()  

    See actual Java code to implement the replacetext() function. Requires familiarity with Java before you can follow all that's happening.

  • Generic UDFs, the Object Inspector and DeferredObjects  

    The generic UDF implements standard functions (one row of input == one row of output) for both primitive and collection data types. 

    This can get pretty complicated as it involves a whole bunch of new concepts such as lazy evaluation of data, the Object Inspector to get the right data types and DeferredObjects to represent data in raw byte form. This lecture looks at all of these in some detail.

  • The Generic UDF: Java implementation for containsstring()  

    See actual Java code to implement the containsstring() function as a generic UDF. Requires familiarity with Java before you can follow all that's happening.

  • The UDAF: Custom aggregate functions can get pretty complex  

    Writing custom aggregate functions requires deep understanding of how the MapReduce for aggregates work. The classes that you implement has methods which plug into the map and the reduce process.

  • The UDAF: Java implementation for max()  

    max() is a simple aggregate function which should help you get your feet wet with UDAFs.

  • The UDAF: Java implementation for Standard Deviation  

    Calculating standard deviations is a harder problem using UDAFs. This lecture has complete code in Java for how you can do this.

  • The Generic UDTF: Custom table generating functions  

    Table generating functions generate multiple rows from a single row of input. This lecture walks through how you can write a generic user defined table generating function (UDTF).

  • The Generic UDTF: Java implementation for namesplit()  

    Java code to implement the namesplit() table generating function. You'll meet ObjectInspectors once again in this one, it's pretty mind bending.

SQL Primer - Select Statemets
  • Select Statements  

    Select statements in SQL and HiveQL are remarkably alike. Let's go through them assuming no prior knowledge

  • Select Statements 2  

    More examples of select statements - practice makes perfect!

  • Operator Functions  

    Operators in HiveQL are not very different from those in SQL.

SQL Primer - Group By, Order By and Having
  • Aggregation Operators Introduced  

    SUM, MAX, MIN, COUNT and AVG are aggregate operators - by definition they operate over a group of rows, rather than a single row

  • The Group By Clause  

    We discussed how aggregation operators need a range of queries to function on. What can that range be? It could be the entire table, but even more likely its some group of the rows in a table, defined by the GROUP-BY operator

  • More Group By Examples  

    Let's keep going with the GROUP-BY, and understand how it divvies up the data in a database

  • Order By  

    We can order the results of a query by one or more columns using the ORDER BY. Remember that relations are technically bags (i.e. multisets) which do not possess order - but this is a convenience taht DBMS make available

  • Having  

    Having is an operator that filters out groups based on a condition. Its like the WHERE clause but it operates on groups rather than individual rows

SQL Primer - Joins
  • Introduction to SQL Joins  

    The full power of databases emerges when we link tables - and Joins are the way to accomplish this

  • Cross Joins aka Cartesian Joins  

    Cross Joins are conceptually simple, which is great, because they are the underpinning of Inner Joins

  • Inner Joins  

    Inner Joins are your best friend. Understand them for what they are: cross joins with a filter condition.

  • Left Outer Joins  

    Outer Joins are really useful if used right. They are a little tricky though - understand how they work, and why you should not be surprised to see NULLs in the result of an outer join.

  • RIght, Full Outer Joins, Natural Joins, Self Joins  

    Once we've understand Inner and Outer joins, Natural Joins are easy-peasey

Appendix
  • [For Linux/Mac OS Shell Newbies] Path and other Environment Variables  

    If you are unfamiliar with softwares that require working with a shell/command line environment, this video will be helpful for you. It explains how to update the PATH environment variable, which is needed to set up most Linux/Mac shell based softwares.

  • Setting up a Virtual Linux Instance - For Windows Users  

    Hadoop is basically for Linux/Unix systems. If you are on Windows, you can set up a Linux Virtual Machine on your computer and use that for the install.

Reviews (0)