BD|CESGA

What?

Just a quick overview of some of the available services ready-to-use.

HDFS

Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers.

YARN

Allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.

MapReduce

Software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Spark

Fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Hive

Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Sqoop

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Jupyter

A web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Zeppelin SOON

Web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

HUE

Web interface for analyzing data with Apache Hadoop. Hue applications let you browse HDFS, manage a Hive metastore, and run Hive and Cloudera Impala queries, HBase and Sqoop commands, Pig scripts, MapReduce jobs, and Oozie workflows.

HBase

Hadoop database, a distributed, scalable, big data store.

Oozie

Workflow scheduler system to manage Apache Hadoop jobs.

Pig

A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Storm

A distributed real-time computation system for processing large volumes of high-velocity data.

Kafka

A unified, high-throughput, low-latency platform for handling real-time data feeds.

Flume

A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Tez

An extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop.

ZooKeeper

A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Mahout

Environment for quickly creating scalable performant machine learning applications.

Slider

An application to deploy existing distributed applications on an Apache Hadoop YARN cluster, monitor them and make them larger or smaller as desired even while the application is running..

Falcon

A framework to simplify data pipeline processing and management on Hadoop clusters.

Atlas

Scalable and extensible set of core foundational governance services enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.

Mesos

A cluster manager that simplifies running applications on a scalable cluster of servers, and the heart of the Mesosphere system.

Marathon

A container orchestration platform for Mesos and DCOS.

Consul

A tool for discovering and configuring services in our Big Data infrastructure.

Cassandra

A distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.

MongoDB

A document-oriented database very easy to use.

PostgreSQL

A popular SQL database. Because not everything has to be NoSQL.

GlusterFS

A distributed-replicated file system.

SLURM SOON

A job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.

CDH

An Apache Hadoop distribution by Cloudera.

Providing quick access to ready-to-use Big Data solutions.

Because Big Data doesn't have to be complicated.

Why?

Easy

Fast

Data Transfer

Free

What?

HDFS