The term Big Data is everywhere in 2012 — at conferences, in the media, in vendor pitches. But what does it actually mean for a Czech company that processes tens to hundreds of gigabytes of data per day? Apache Hadoop may be the answer.
What Big Data Is and When You Actually Need It¶
Gartner defines Big Data using three Vs: Volume, Velocity, and Variety. If your data meets at least two of these criteria and your existing relational database is no longer keeping up, it’s time to look at alternatives.
Typical scenarios where Hadoop pays off:
- Log analysis — web servers, application servers, and network devices generate gigabytes of logs daily. SQL queries over such a table take hours.
- ETL for a data warehouse — transforming and cleansing data before loading it into an Oracle or SQL Server data warehouse
- Customer behavior analysis — clickstream data from e-shops, telecom CDR records
- Archiving and search — old documents, emails, scans — unstructured data where full-text search isn’t enough
If you’re processing less than 10 GB of data and your queries run in a reasonable time, you probably don’t need Hadoop. A relational database with good indexes and materialized views is more efficient for smaller volumes.
Hadoop Cluster Architecture¶
Hadoop consists of two key components:
HDFS (Hadoop Distributed File System) — a distributed file system that replicates data across multiple nodes (the default replication factor is 3). Data is split into blocks of 64 MB (in Hadoop 2.x typically 128 MB) and distributed across the cluster.
MapReduce — a computation framework that processes data in parallel on all nodes where the data resides. Instead of moving data to the computation, the computation is moved to the data — that is the key principle.
Minimum production cluster for enterprise use:
- NameNode — 1 server, 64 GB RAM, RAID 1, the HDFS control node
- Secondary NameNode — 1 server, metadata backup
- DataNode / TaskTracker — minimum 4 servers, each with 32 GB RAM, 4–12 drives without RAID (HDFS replicates itself), 8+ CPU cores
- Edge node — 1 server for client access, Hive, Pig, data import/export
Distributions: Cloudera vs. Apache vs. Hortonworks¶
Pure Apache Hadoop can be run, but for enterprise deployments we recommend one of the commercial distributions:
Cloudera CDH 4 — the most widely adopted enterprise distribution. Includes Hadoop, Hive, HBase, Pig, Oozie, and Cloudera Manager for cluster management. Commercial support and certification. We recommend this option for most of our clients.
Hortonworks HDP 1.x — a fully open-source distribution. No proprietary management tool (uses Apache Ambari). Suitable for companies with their own Hadoop expertise.
MapR — replaces HDFS with its own high-performance file system. An interesting option for low latency, but introduces vendor lock-in.
Hive: SQL Over Hadoop¶
For analytical queries over data in HDFS, Apache Hive is the ideal choice. Hive allows you to write queries in an SQL-like language (HiveQL) that are internally converted to MapReduce jobs.
-- Example: web access analysis for the past month
CREATE EXTERNAL TABLE access_log (
ip STRING,
request_time STRING,
method STRING,
url STRING,
status INT,
bytes BIGINT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/logs/access/';
SELECT url, COUNT(*) as hits, SUM(bytes) as total_bytes
FROM access_log
WHERE status = 200
GROUP BY url
ORDER BY hits DESC
LIMIT 100;
This query processes terabytes of logs in parallel across the entire cluster. On a relational database it would take hours — on a Hadoop cluster with 8 nodes, it takes minutes.
Integration with the Existing Ecosystem¶
Hadoop is not a replacement for Oracle or SQL Server. It’s a complement. A typical workflow:
- Sqoop — import data from a relational database into HDFS
- MapReduce / Hive — transformation and aggregation
- Sqoop export — results back into the relational database for BI tools
For real-time data collection (logs, events) we use Apache Flume, which streams data directly into HDFS. For messaging between systems, Apache Kafka is appropriate (a relatively new project from LinkedIn, but already stable).
Operational Aspects¶
A Hadoop cluster requires a different operational approach than a traditional application server:
- Monitoring — Ganglia or Nagios with Hadoop plugins. Monitor HDFS capacity, live DataNode count, the MapReduce queue, and failed jobs.
- Backup — HDFS has built-in replication, but NameNode metadata is a single point of failure. Back up
fsimageand the edits log. - Capacity planning — data in HDFS grows. Plan for 30–50 percent free space. Adding DataNodes is easy — Hadoop automatically rebalances data.
- Security — Hadoop has no authentication by default. For enterprise deployments, enable Kerberos integration.
Costs and ROI¶
Hadoop runs on commodity hardware — that is its main economic advantage. A cluster with 8 DataNodes on standard 2U servers costs roughly 1–2 million CZK including disks. Comparable performance on a commercial MPP database (Teradata, Netezza) costs several times more.
ROI typically materializes in these areas:
- Faster ETL processes — from hours to minutes
- Analyses that were previously impossible (full-text across millions of documents)
- Long-term data archiving at a fraction of SAN storage cost
- Offloading load from the production database
Summary¶
Hadoop is not a silver bullet, but for the right use cases it delivers dramatic improvements. Start with one concrete problem — log analysis or ETL offload — and scale the cluster according to real needs. Cloudera CDH 4 is a solid choice for Czech enterprise environments with available support. The key is having at least one person on the team who understands Hadoop — whether internal or external.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us