Hadoop Flume was created in the course of incubator Apache project to allow you to flow data from a source into your Hadoop environment. In Flume, the entities you work with are called sources, decorators, and sinks. A source can be any data source, and Flume has many predefined source adapters. A sink is the target of a specific operation (and in Flume, among other paradigms that use this term, the sink of one operation can be the source for the next downstream operation). A decorator is an operation on the stream that can transform the stream in some manner, which could be to compress or uncompress data, modify data by adding or removing pieces of information, and more. Flume allows you a number of different configurations and topologies, allowing you to choose the right setup for your application. Flume is a distributed system which runs across multiple machines. It can collect large volumes of data from many applications and systems. It includes mechanisms for load balancing and failover, and it can be extended and customized in many ways. Flume is a scalable, reliable, configurable and extensible system for management the movement of large volumes of data.
Was the above useful? Please share with others on social media.
If you want to look for more information, check some free online courses available at coursera.org, edx.org or udemy.com.
Recommended reading list:
|Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale
Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing.
Learn fundamental components such as MapReduce, HDFS, and YARN
Explore MapReduce in depth, including steps for developing applications with it
Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN
Learn two data formats: Avro for data serialization and Parquet for nested data
Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer)
Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop
Learn the HBase distributed database and the ZooKeeper distributed configuration service
|Hadoop Application Architectures: Designing Real-World Big Data Applications
Get expert guidance on architecting end-to-end data management solutions with Apache Hadoop. While many sources explain how to use various components in the Hadoop ecosystem, this practical book takes you through architectural considerations necessary to tie those components together into a complete tailored application, based on your particular use case.
To reinforce those lessons, the second section provides detailed examples of architectures used in some of the most commonly found Hadoop applications. Whether designing a new Hadoop application or planning to integrate Hadoop into your existing data infrastructure, Hadoop Application Architectures will skillfully guide you through the process.
This book covers:
Factors to consider when using Hadoop to store and model data
Best practices for moving data in and out of the system
Data processing frameworks, including MapReduce, Spark, and Hive
Common Hadoop processing patterns, such as removing duplicate records and using windowing analytics
Giraph, GraphX, and other tools for large graph processing on Hadoop
Using workflow orchestration and scheduling tools such as Apache Oozie
Near-real-time stream processing with Apache Storm, Apache Spark Streaming, and Apache Flume
Architecture examples for clickstream analysis, fraud detection, and data warehousing
|Data Analytics with Hadoop: An Introduction for Data Scientists
Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the Hadoop ecosystem is perfect for the job. Instead of deployment, operations, or software development usually associated with distributed computing, you’ll focus on particular analyses you can build, the data warehousing techniques that Hadoop provides, and higher order data workflows this framework can produce.
Data scientists and analysts will learn how to perform a wide range of techniques, from writing MapReduce and Spark applications with Python to using advanced modeling and data management with Spark MLlib, Hive, and HBase. You’ll also learn about the analytical processes and data systems available to build and empower data products that can handle—and actually require—huge amounts of data.
Understand core concepts behind Hadoop and cluster computing
Use design patterns and parallel analytical algorithms to create distributed data analysis jobs
Learn about data management, mining, and warehousing in a distributed context using Apache Hive and HBase
Use Sqoop and Apache Flume to ingest data from relational databases
Program complex Hadoop and Spark applications with Apache Pig and Spark DataFrames
Perform machine learning techniques such as classification, clustering, and collaborative filtering with Spark’s MLlib
|Hadoop: The Definitive Guide
Ready to unlock the power of your data? With this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters.
You’ll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This third edition covers recent changes to Hadoop, including material on the new MapReduce API, as well as MapReduce 2 and its more flexible execution model (YARN).
Store large datasets with the Hadoop Distributed File System (HDFS)
Run distributed computations with MapReduce
Use Hadoop’s data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence
Discover common pitfalls and advanced features for writing real-world MapReduce programs
Design, build, and administer a dedicated Hadoop cluster—or run Hadoop in the cloud
Load data from relational databases into HDFS, using Sqoop
Perform large-scale data processing with the Pig query language
Analyze datasets with Hive, Hadoop’s data warehousing system
Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems
|Hadoop 2 Quick-Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem (Addison-Wesley Data & Analytics)
With Hadoop 2.x and YARN, Hadoop moves beyond MapReduce to become practical for virtually any type of data processing. Hadoop 2.x and the Data Lake concept represent a radical shift away from conventional approaches to data usage and storage. Hadoop 2.x installations offer unmatched scalability and breakthrough extensibility that supports new and existing Big Data analytics processing methods and models.
Hadoop® 2 Quick-Start Guide is the first easy, accessible guide to Apache Hadoop 2.x, YARN, and the modern Hadoop ecosystem. Building on his unsurpassed experience teaching Hadoop and Big Data, author Douglas Eadline covers all the basics you need to know to install and use Hadoop 2 on personal computers or servers, and to navigate the powerful technologies that complement it.
Eadline concisely introduces and explains every key Hadoop 2 concept, tool, and service, illustrating each with a simple “beginning-to-end” example and identifying trustworthy, up-to-date resources for learning more.
This guide is ideal if you want to learn about Hadoop 2 without getting mired in technical details. Douglas Eadline will bring you up to speed quickly, whether you’re a user, admin, devops specialist, programmer, architect, analyst, or data scientist.
Understanding what Hadoop 2 and YARN do, and how they improve on Hadoop 1 with MapReduce
Understanding Hadoop-based Data Lakes versus RDBMS Data Warehouses
Installing Hadoop 2 and core services on Linux machines, virtualized sandboxes, or clusters
Exploring the Hadoop Distributed File System (HDFS)
Understanding the essentials of MapReduce and YARN application programming
Simplifying programming and data movement with Apache Pig, Hive, Sqoop, Flume, Oozie, and HBase
Observing application progress, controlling jobs, and managing workflows
Managing Hadoop efficiently with Apache Ambari–including recipes for HDFS to NFSv3 gateway, HDFS snapshots, and YARN configuration
Learning basic Hadoop 2 troubleshooting, and installing Apache Hue and Apache Spark