Library

Course: Learn By Example: Hadoop, MapReduce for Big Data problems

Learn By Example: Hadoop, MapReduce for Big Data problems

  • Life Time Access
  • Certificate on Completion
  • Access on Android and iOS App
About this Course

Taught by a 4 person team including 2 Stanford-educated, ex-Googlers and 2 ex-Flipkart Lead Analysts. This team has decades of practical experience in working with Java and with billions of rows of data. 

This course is a zoom-in, zoom-out, hands-on workout involving Hadoop, MapReduce and the art of thinking parallel. 

Let’s parse that.

Zoom-in, Zoom-Out: This course is both broad and deep. It covers the individual components of Hadoop in great detail, and also gives you a higher level picture of how they interact with each other. 

Hands-on workout involving Hadoop, MapReduce : This course will get you hands-on with Hadoop very early on. You'll learn how to set up your own cluster using both VMs and the Cloud. All the major features of MapReduce are covered - including advanced topics like Total Sort and Secondary Sort. 

The art of thinking parallel: MapReduce completely changed the way people thought about processing Big Data. Breaking down any problem into parallelizable units is an art. The examples in this course will train you to "think parallel". 

What's Covered: Lot's of cool stuff ..

Using MapReduce to: 

Recommend friends in a Social Networking site: Generate Top 10 friend recommendations using a Collaborative filtering algorithm. 

Build an Inverted Index for Search Engines: Use MapReduce to parallelize the humongous task of building an inverted index for a search engine. 

Generate Bigrams from text: Generate bigrams and compute their frequency distribution in a corpus of text. 

Build your Hadoop cluster: 

Install Hadoop in Standalone, Pseudo-Distributed and Fully Distributed modes 

Set up a hadoop cluster using Linux VMs.

Set up a cloud Hadoop cluster on AWS with Cloudera Manager.

Understand HDFS, MapReduce and YARN and their interaction 

Customize your MapReduce Jobs: 

Chain multiple MR jobs together

Write your own Customized Partitioner

Total Sort : Globally sort a large amount of data by sampling input files

Secondary sorting 

Unit tests with MR Unit

Integrate with Python using the Hadoop Streaming API

.. and of course all the basics: 

MapReduce : Mapper, Reducer, Sort/Merge, Partitioning, Shuffle and Sort

HDFS & YARN: Namenode, Datanode, Resource manager, Node manager, the anatomy of a MapReduce application, YARN Scheduling, Configuring HDFS and YARN to performance tune your cluster. 

Using discussion forums

Please use the discussion forums on this course to engage with other students and to help each other out. Unfortunately, much as we would like to, it is not possible for us at Loonycorn to respond to individual questions from students:-(

We're super small and self-funded with only 2 people developing technical video content. Our mission is to make high-quality courses available at super low prices.

The only way to keep our prices this low is to *NOT offer additional technical support over email or in-person*. The truth is, direct support is hugely expensive and just does not scale.

We understand that this is not ideal and that a lot of students might benefit from this additional support. Hiring resources for additional support would make our offering much more expensive, thus defeating our original purpose.

It is a hard trade-off.

Thank you for your patience and understanding!

Who is the target audience?

Yep! Analysts who want to leverage the power of HDFS where traditional databases don't cut it anymore

Yep! Engineers who want to develop complex distributed computing applications to process lot's of data

Yep! Data Scientists who want to add MapReduce to their bag of tricks for processing data

Basic knowledge
  • You'll need an IDE where you can write Java code or open the source code that's shared. IntelliJ and Eclipse are both great options.
  • You'll need some background in Object-Oriented Programming, preferably in Java. All the source code is in Java and we dive right in without going into Objects, Classes etc
  • A bit of exposure to Linux/Unix shells would be helpful, but it won't be a blocker
What you will learn
  • Develop advanced MapReduce applications to process BigData
  • Master the art of "thinking parallel" - how to break up a task into Map/Reduce transformations
  • Self-sufficiently set up their own mini-Hadoop cluster whether it's a single node, a physical cluster or in the cloud.
  • Use Hadoop + MapReduce to solve a wide variety of problems : from NLP to Inverted Indices to Recommendations
  • Understand HDFS, MapReduce and YARN and how they interact with each other
  • Understand the basics of performance tuning and managing your own cluster
Curriculum
Lectures quantity: 73
Common duration: 13:46:00
Introduction
  • You, this course and Us  

    We start off with an introduction on what this course is all about.

Why is Big Data a Big Deal
  • The Big Data Paradigm  

    Big data may be a cliched term, but what does it really mean? Where does this data come from and why is it big?

  • Serial vs Distributed Computing  

    Distributed computing makes processing very fast - but why? Let's take a simple example and see why distributed computing is so powerful.

  • What is Hadoop?  

    What exactly is Hadoop? Its origins and its logical components explained.

  • HDFS or the Hadoop Distributed File System  

    HDFS based on GFS (The Google File System) is the storage layer within Hadoop. It stores files in blocks of 128MB.

  • MapReduce Introduced  

    MapReduce is the framework which allows developers to write massively parallel programs without worrying about the underlying details of distributed computing. The developer simply implements the map() and reduce() functions in order to crunch large input sets of data.

  • YARN or Yet Another Resource Negotiator  

    Yarn is responsible for managing resources in the Hadoop cluster. Yarn was introduced recently in Hadoop 2.0.

Installing Hadoop in a Local Environment
  • Hadoop Install Modes  

    Hadoop has 3 different install modes - Standalone, Pseudo-distributed and Fully Distributed. Get an overview of when to use each

  • Hadoop Standalone mode Install  

    How to set up Hadoop in the standalone mode. Windows users need to install a Virtual Linux instance before this video.

  • Hadoop Pseudo-Distributed mode Install  

    Set up Hadoop in the Pseudo-Distributed mode. All Hadoop services will be up and running!

The MapReduce "Hello World"
  • The basic philosophy underlying MapReduce  

    In the world of MapReduce every problem can be thought of in terms of key values pairs. Map transforms the key-value pair in a meaningful way, they are sorted and merged and reduce combines key-value pairs in a meaningful way.

  • MapReduce - Visualized And Explained  

    If you're learning MapReduce for the very first time - it's best to visualize what exactly it does before you get down into the little details.

  • MapReduce - Digging a little deeper at every step  

    What really goes on with a single record as it flows through the map and then reduce phase?

  • "Hello World" in MapReduce  

    Counting the number of times a word occurs in input text is the Hello World of MapReduce. This was the very first example given in Jeff Dean and Sanjay Ghemawat's original paper on MapReduce.

  • The Mapper  

    Nothing is real unless it is on code. Setting up our very first Mapper.

  • The Reducer  

    Nothing is real unless it is on code. Setting up our very first Reducer.

  • The Job  

    Nothing is real unless it is on code. Setting up our very first MapReduce Job.

Run a MapReduce Job
  • Get comfortable with HDFS  

    Learn how to use HDFS's command line interface and add data to HDFS to run your jobs on.

  • Run your first MapReduce Job  

    Run your very first MapReduce Job. We'll also explore the Web interface for YARN and HDFS and see how to track your jobs.

Juicing your MapReduce - Combiners, Shuffle and Sort and The Streaming API
  • Parallelize the reduce phase - use the Combiner  

    The reduce phase can be optimized by combining the output of the map phase at the map node itself. This is an optimization of the reduce phase to allow it to work on data that has been "partially reduced".

  • Not all Reducers are Combiners  

    Using a Combiner should not change the output of the MapReduce. Which means not every Reducer can work as a combine function

  • How many mappers and reducers does your MapReduce have?  

    The number of mapper processes depend on the number of input splits of your data. It's not really in your control. What you, as a developer, do control, is the number of reducers.

  • Parallelizing reduce using Shuffle And Sort  

    In order to have more than one Reducer work on your map data, you need partitions. Visualize how partitions and shuffle and sort work.

  • MapReduce is not limited to the Java language - Introducing the Streaming API  

    The Hadoop Streaming API uses the standard input and output to communicate with mapper and reducer functions in any language. Understand how Hadoop interacts with mappers and reducers in other languages.

  • Python for MapReduce  

    It's not real till it's in code. Implement the word count MapReduce example in Python using the Streaming API.

HDFS and Yarn
  • HDFS - Protecting against data loss using replication  

    Let's understand HDFS and it's data replication strategy in some detail.

  • HDFS - Name nodes and why they're critical  

    Name nodes provide an index of what file is stored where in the data nodes. If the name node is lost the mapping of where the files are is lost. Which means even though the data is present in the data nodes, we'll have no idea how to access it!

  • HDFS - Checkpointing to backup name node information  

    Hadoop backs up name nodes using two strategies. Backing up the snapshot and edits to the file system and by setting up a secondary name node.

  • Yarn - Basic components  

    The Resource Manager assigns resources to processes based on policies and constraints of the cluster while the Node Manager manages memory, and other resource for a single node. These two form the basic components of Yarn.

  • Yarn - Submitting a job to Yarn  

    What happens under the hood when you submit a job to Yarn? Resource Manager, Container, the Application Master and the Node Manager all work together to run your MapReduce job.

  • Yarn - Plug in scheduling policies  

    The Resource Manager acts as a pure scheduler and allows plugging in different policies to schedule jobs. Understand how the FIFO scheduler, the Capacity scheduler and the Fair scheduler work.

  • Yarn - Configure the scheduler  

    The user has a lot of leeway in configuring how the scheduler works. Let's study some of the options we can specify in the various config files.

MapReduce Customizations For Finer Grained Control
  • Setting up your MapReduce to accept command line arguments  

    The Main class in your MapReduce needs some special set up before it can accept command line arguments.

  • The Tool, ToolRunner and GenericOptionsParser  

    The library classes and interfaces which allow parsing command line arguments. Learn what they are and how to use them.

  • Configuring properties of the Job object  

    The Job object allows you to plug in your own classes to control inputs, outputs and many intermediate steps in the MapReduce.

  • Customizing the Partitioner, Sort Comparator, and Group Comparator  

    Between the Map phase and the Reduce phase lie a whole number of intermediate steps performed by the Hadoop framework. Partitioning, Sorting and Grouping are 3 specific operations and each of these can be customized to fit your problem statement.

The Inverted Index, Custom Data Types for Keys, Bigram Counts and Unit Tests!
  • The heart of search engines - The Inverted Index  

    The Inverted Index which provides a mapping from every word to the page on which that word occurs is at the heart of every search engine. This is one of the original use cases for MapReduce.

  • Generating the inverted index using MapReduce  

    It's not real unless it's in code, generate the inverted index using a MR job.

  • Custom data types for keys - The Writable Interface  

    Understand why we need the Writable and the WritableComparable interface and why the keys in the Mapper output implement these interfaces.

  • Represent a Bigram using a WritableComparable  

    A Bigram is a pair of adjacent words, use a special data type to represent a Bigram, it needs to be a WritableComparable to be serialized across the network and sorted and merged by Hadoop.

  • MapReduce to count the Bigrams in input text  

    Use the Bigram data type in your MapReduce to produce a count of all Bigrams in the input text file.

  • Setting up your Hadoop project  

    Follow these instructions to set up your Hadoop project. 

  • Test your MapReduce job using MRUnit  

    No code is complete without unit tests. The MRUnit framework uses JUnit to test MapReduce jobs. Write test cases for the Bigram count code.

Input and Output Formats and Customized Partitioning
  • Introducing the File Input Format  

    The Input Format specifies the kind of input data that feeds into the MapReduce. The FileInputFormat is the base class for all inputs which are files

  • Text And Sequence File Formats  

    The most common kind of files are text files and binary files and Hadoop has built in library classes to represent both of these.

  • Data partitioning using a custom partitioner  

    What if you want to partition on something other than key hashes? Custom partitioners allow you to partition on whatever metric you, you just need to write a bit of code.

  • Make the custom partitioner real in code  
  • Total Order Partitioning  

    Total Order Partitioning is a mind bending concept in Hadoop. This allows you to locally sort data such that it's in globally sorted order. Sounds confusing? It is a hard concept to wrap one's head around but the results are pretty amazing!

  • Input Sampling, Distribution, Partitioning and configuring these  

    Input sampling, samples the input data to produce a key to partition mapping. The total order partitioner uses this mapping to partition the data in such a manner that locally sorting the data results in a globally sorted result.

  • Secondary Sort  

    The Hadoop Sort/Merge operation sorts the output keys of the mapper. Here is a neat trick to sort the values for each key as well.

Recommendation Systems using Collaborative Filtering
  • Introduction to Collaborative Filtering  

    At the heart of recommendation systems is a beautifully simple idea called collaborative filtering. If 2 users have a lot in common then the chances are that what one user likes the other will as well. You can recommend the users' likes to each other!

  • Friend recommendations using chained MR jobs  

    Recommend potential friends to all users of a social network. This involves using 2 MapReduce jobs and chaining them in such a way that the output of one MapReduce feeds into the second MapReduce

  • Get common friends for every pair of users - the first MapReduce  

    The first MapReduce finds the number of common friends for every pair of users. This requires special treatment for users who are already friends.

  • Top 10 friend recommendation for every user - the second MapReduce  

    The second MapReduce takes in the common friends for every pair of users and generates the top 10 friend recommendations for every user of the social network.

    Note there are 2 MR jobs chained together.

Hadoop as a Database
  • Structured data in Hadoop  

    How is Hadoop different from a database? Can we leverage the power of Hadoop for structured data?

  • Running an SQL Select with MapReduce  

    Let's see how to implement SQL Select , Where constructs using MapReduce

  • Running an SQL Group By with MapReduce  

    Select and Where constructs are implemented in the Mapper. Group By and having constructs are implemented in the reducer.

  • A MapReduce Join - The Map Side  

    Joins can be surprisingly tricky to implement with MapReduce - Let's see what the Mapper looks like for a Join. 

  • A MapReduce Join - The Reduce Side  

    What should the Reducer do in a Join? 

  • A MapReduce Join - Sorting and Partitioning  

    For the Join to work properly, you'll need to customize how the Sorting and Partitioning occurs in the MapReduce. 

  • A MapReduce Join - Putting it all together  

    We continue with MapReduce joins. Let's put everything together in the Job. 

K-Means Clustering
  • What is K-Means Clustering?  

    K-Means Clustering is a popular machine learning algorithm. We start with an intro to Clustering and how the K-Means algorithm works. 

  • A MapReduce job for K-Means Clustering  

    We continue with understanding the K-Means algorithm. We'll break down the algorithm into a MapReduce task. 

  • K-Means Clustering - Measuring the distance between points  

    We'll start describing the code to implement K-Means with MapReduce. First some setup to represent data as points and measure the distance between them.

  • K-Means Clustering - Custom Writables for Input/Output  

    We need to set up a couple of Custom Writables that can be used by MapReduce for the input and output. 

  • K-Means Clustering - Configuring the Job  

    We're finally on to the MapReduce bits. We start by configuring the job and doing some setup that will be needed for the Mapper/Reducer.

  • K-Means Clustering - The Mapper and Reducer  

    The Mapper and Reducer for K-Means run once for each iteration and update the cluster centers. 

  • K-Means Clustering : The Iterative MapReduce Job  

    Finally, we need to set it up so that the Jobs run iteratively and stop when convergence occurs. 

Setting up a Hadoop Cluster
  • Manually configuring a Hadoop cluster (Linux VMs)  

    Manually configure a Hadoop cluster. You'll use Linux Virtual Machines to do this. Please go through the "Setting up a Virtual Linux Instance (in the Installing Hadoop on Local Environment section) before this video. 

  • Getting started with Amazon Web Servicies  

    You can use a cloud service to setup a Hadoop Cluster. This video gets you started with AWS and the Elastic Compute 2 CLI tools

  • Start a Hadoop Cluster with Cloudera Manager on AWS  

    Install Cloudera Manager on AWS and use it to launch a Hadoop Cluster.

Appendix
  • Setup a Virtual Linux Instance (For Windows users)  

    Hadoop is basically for Linux/Unix systems. If you are on Windows, you can set up a Linux Virtual Machine on your computer and use that for the install. 

  • [For Linux/Mac OS Shell Newbies] Path and other Environment Variables  

    If you are unfamiliar with softwares that require working with a shell/command line environment, this video will be helpful for you. It explains how to update the PATH environment variable, which is needed to set up most Linux/Mac shell based softwares. 

reviews (0)
Average rating
0
0 voices
Detailed rating
5 stars
0%
4 stars
0%
3 stars
0%
2 stars
0%
1 stars
0%