Big Data Computing with Hadoop
taught by Marc Vaisman
In this online course, “Big Data Computing with Hadoop,” analytics professionals will be introduced to Hadoop and Spark, and provided with an exemplar workflow for using Hadoop. They also will be introduced to writing Spark and MapReduce jobs, and leveraging Hadoop Streaming to conclude work in an analytics programming language such as Python. In this course you will learn
- What Hadoop is hand how to leverage it to perform analytics
- The software components of the Hadoop Ecosystem
- How to manage data on a distributed file system
- How to use Spark and MapReduce to perform computations with Hadoop
- How to utilize Hadoop Streaming to output jobs
Background - "Big Data"
The term “Big Data” has come into vogue vogue to refer not just to data volume, but also to an exciting new set of applications and techniques that are powering modern applications and whose novelty seems to be changing the way the world is computing. In most cases, the "end game" is the application of well-known statistical and machine-learning techniques. However, modern distributed computation techniques are allowing the analysis of data sets far larger then those that could be typically analyzed in the past.
The need for distributed computing arises from a combination of rapidly increasing data sets flows generated by organizations and from the Internet, and the fact that the huge size of these data sets greatly widens the scope for prediction and analysis. A key milestone was the release of Apache Hadoop in 2005. Hadoop is an open source project based on two seminal papers produced by Google: The Google File System (2003) and MapReduce: Simplified Data Processing on Large Clusters (2004). These two papers discuss the two key components that make up Hadoop: a distributed File System and MapReduce functional computations. Now it seems that whenever someone says “Big Data” they probably are referring to computation using Hadoop.
Here's an excellent introduction to Spark, the newest component of the Hadooop ecosystem.
WEEK 1: A Distributed Computing Environment
The first week is all about getting to know Hadoop and getting set up to develop MapReduce jobs on a development environment. This task by itself is not particularly easy, but is crucial to getting started with Hadoop.
- Introduce Hadoop, its motivations and core concepts
- Discover HDFS and MapReduce and their roles
- NameNodes, JobTrackers, and DataNodes (The Hadoop Anatomy)
- Learn about the other applications in the Hadoop Ecosystem
- Get a development environment set up
WEEK 2: Working with Hadoop
In week 2, we’ll explore how to use the Hadoop Filesystem to load and manage data. We’ll also learn the data flow of Hadoop jobs and execute some simple, pre-built jobs.
- Introduce the Hadoop Filesystem
- Learn how to read and write data to HDFS
- Learn data flow in Hadoop Jobs
WEEK 3: Computing with MapReduce
We’ll kick week three off with a discussion of MapReduce programming, and write our first MapReduce jobs to execute on our Hadoop cluster. This is where the rubber meets the road, and we’ll use Hadoop Streaming and the language of your choice to develop simple analytics.
- Functional programming with Mappers and Reducers
- A sample MapReduce Algorithm
- Mappers and Reducers in Detail
- Running MapReduce jobs
- Hadoop Streaming
WEEK 4: Towards Last Mile Computation
In the last section we’ll discuss how to use Hadoop to transform large data sets into a more manageable computational size. We’ll talk about workflows towards "last mile computation," filtering, searching, and aggregating, as well as writing some more MapReduce/Spark jobs.
- Combiners, partitioners and job chaining
- Design patterns with map and reduce (relevant for for both MapReduce and Spark)
- Filtering, Aggregating and Searching
- Data Organization and Workflow Management
MATERIALS AND HOMEWORK
In addition to assigned readings, this course also has example software codes, supplemental readings available online, and coding exercises that can be done in either MapReduce or Spark.
Big Data Computing with Hadoop
Data scientists and statisticians with programming experience who need to deal with large data sets and want to learn about Hadoop's distributing computing capability should take Big Data Computing with Hadoop. This course is particularly suited to data scientists that need to access and analyze large amounts of unstructured or semi-structured data that do not fit well into traditional relational databases.
Command line experience on Linux, to manage system processes, find appropriate files and set permissions.
Familiarity with Python or another programming language to leverage Hadoop streaming to perform computations.
See the "Software" section below.
This course takes place online at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. Course participants will be given access to a private discussion board. In class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.
At the beginning of each week, you receive the relevant material, in addition to answers to exercises from the previous session. During the week, you are expected to go over the course materials, work through exercises, and submit answers. Discussion among participants is encouraged. The instructor will provide answers and comments, and at the end of the week, you will receive individual feedback on your homework answers.
About 15 hours per week, at times of your choosing.
Students come to the Institute for a variety of reasons. As you begin the course, you will be asked to specify your category:
- No credit - You may be interested only in learning the material presented, and not be concerned with grades or a record of completion.
- Certificate - You may be enrolled in PASS (Programs in Analytics and Statistical Studies) that requires demonstration of proficiency in the subject, in which case your work will be assessed for a grade.
- CEUs and/or proof of completion - You may require a "Record of Course Completion," along with professional development credit in the form of Continuing Education Units (CEU's). For those successfully completing the course, CEU's and a record of course completion will be issued by The Institute, upon request.
- Other options - Statistics.com Specializations, INFORMS CAP recognition, and academic (college) credit are available for some Statistics.com courses
Please read the mentioned papers produced by Google: The Google File System (2003) and MapReduce: Simplified Data Processing on Large Clusters (2004)
Recommended text: Hadoop: The Definitive Guide, 3rd ed., by Tom White (O'Reilly Media). Optional readings will be assigned from this reference.
Required readings will be provided as PDF documents in the course.
Before the course starts we recommend that you:
1. Install virtualization software so you can run the VM in the course. We recommend VirtualBox which is free; VMWare, or Parallels are also possible. Our technology supervisor, Dr. Stan Blank, will monitor a discussion board in our Learning Management System 4 days prior to the course start to provide assistance.
2. Download the pre-configured Virtual Machine (VM) that will be used in the course. Note that you will receive a preview error message. This is OK. Click on the download button below the message.
3. Using the downloaded VM, which includes Linux, brush up on your command-line Linux. If you really need to re-learn Linux, or learn it in the first place, you should allow several weeks to do this on your own before the course starts. For help, see The Command Line Crash Course .
4. Make sure you have Python available and a text editor such as Sublime Text to write your code. Once the course opens, we will be working with Python to execute jobs via Hadoop Streaming. There are several frameworks available to assist writing Hadoop jobs in Python, which will be discussed during the course.If you know Java or other languages...
- Those with experience in Java may use the Native API to implement MapReduce jobs, but the class will focus on Hadoop Streaming. To access more advanced functionality you’ll need some tool to develop and compile Java. The most well known are Eclipse and NetBeans, as well as a popular, professional IDE- IntelliJIDEA. For this course, however, this is completely optional.
- For the programming work, any programming language that accepts data from stdin and writes to stdout (R, Ruby, Perl, etc.) can be used, but all examples and pseudo-code will be in Python.
If you prefer to set up your own VM ...
- If you prefer not to download the pre-configured VM, instructions in the course will describe how to setup an Ubuntu x64 virtual machine
To be scheduled.
Big Data Computing with Hadoop
To be scheduled.
Course Fee: $549
Do you meet course prerequisites? What about book & software? (Click here to learn more)
Group rates: Click here to get information on group rates.
First time student or academic? Click here for an introductory offer on select courses. Academic affiliation? You may be eligible for a discount at checkout.
Add $50 service fee if you require a prior invoice, or if you need to submit a purchase order or voucher, pay by wire transfer or EFT, or refund and reprocess a prior payment. Please use this printed registration form, for these and other special orders.
Courses may fill up at any time and registrations are processed in the order in which they are received. Your registration will be confirmed for the first available course date, unless you specify otherwise.
The Institute for Statistics Education is certified to operate by the State Council of Higher Education in Virginia (SCHEV).