Big Data and Hadoop in Enterprise

“Big Data” — a phrase that in 2014 resonates at every conference, in every boardroom, and in every sales pitch. But between the marketing buzzword and a real enterprise deployment, there’s a chasm. We crossed it. Here’s what we learned.

Why Hadoop at All?¶

Our client — a large Czech insurance company — had a problem it knew well: terabytes of logs from web applications, ten years of transaction data, call-centre records, and data from branch offices. All of it sat in an Oracle Data Warehouse that cost more per month than the entire IT team. Analytical queries ran for hours. And new analysts wanted to work with data in ways a relational database simply wasn’t built for.

Apache Hadoop offered an attractive alternative: a distributed file system (HDFS) running on commodity hardware, a MapReduce computation framework for parallel processing, and a growing ecosystem of tools on top. Cost per terabyte of storage: a fraction of an Oracle license.

Architecture — What It Looks Like in Practice¶

We deployed Cloudera CDH 5 on a 12-server cluster. Nothing exotic — standard rack servers with 64 GB RAM, 12 SATA disks, and a gigabit network. Total raw storage: approximately 500 TB. After HDFS triple replication, about 160 TB of usable space.

# Basic cluster architecture
NameNode (2x HA)  → metadata, HDFS namespace
DataNode (10x)     → block storage
ResourceManager    → YARN, MapReduce job scheduling
Hive Metastore     → SQL-like access to data
Oozie              → workflow orchestration

The key decision was to use Hive as the primary analytics tool. Analysts knew SQL — they didn’t want and didn’t need to write Java MapReduce jobs. HiveQL gave them a familiar interface over unfamiliar infrastructure.

ETL Pipeline — Getting Data In¶

The biggest challenge wasn’t getting Hadoop running. It was getting data from dozens of different sources into HDFS in a sensible format. Our ETL pipeline looked like this:

Apache Sqoop — importing data from Oracle into HDFS in Avro format
Flume — real-time ingestion of web logs
Custom Java jobs — data transformation and cleansing
Oozie coordinator — daily and hourly workflows

A Sqoop full import of a table with 200 million rows took 45 minutes. An incremental import of the last 24 hours took less than a minute. Compared to the previous ETL in Informatica, it was an order-of-magnitude improvement.

What Works Great¶

Ad-hoc analytics. An analyst writes a HiveQL query, submits it, and in a few minutes has results from the entire data history. No waiting for IT, no requesting new dimensions in the data warehouse.

Scalability. Need more space? Add a DataNode. Hadoop handles rebalancing on its own. No downtime, no migrations. A luxury we weren’t used to with Oracle.

Fault tolerance. One server out of twelve went down. The cluster continued as if nothing happened. Data was on two other replicas. Automatic re-replication ran in the background. With Oracle RAC, this would have been a night-time incident.

What Hurts¶

Latency. Hadoop is not for interactive queries. A MapReduce job has a startup overhead of 30–60 seconds. Even a simple SELECT COUNT(*) takes a minute. Not suitable for a real-time dashboard. We’re watching Apache Spark and Impala — both promise order-of-magnitude improvements, but they’re still too fresh for production.

Operational complexity. A Hadoop cluster is not “set it and forget it”. HDFS balancing, YARN tuning, Hive optimization, security (Kerberos!) — all of this requires an experienced ops team. And there aren’t many of those in the Czech Republic yet.

Java ecosystem. Everything is Java. Classpath hell, JAR versioning, Hadoop version vs. Hive version vs. Sqoop version — dependency management is a nightmare. Maven doesn’t help when every component pulls a different version of Guava.

Lessons for Enterprise¶

After six months of operation, we took away several key lessons:

Hadoop doesn’t replace a data warehouse — it complements it. OLTP stays in Oracle, cold data and exploration go to Hadoop.
Investing in data governance from the start pays off. Without a catalog and lineage, you’ll get lost in petabytes of data.
Cloudera Manager (or Ambari) is a necessity. Manual cluster management is a path to hell.
Training the analytics team is half the battle. Technology without users is expensive furniture.
Don’t underestimate the network infrastructure. MapReduce shuffle generates enormous network traffic.

Numbers That Convinced Management¶

TCO for the first year of the Hadoop cluster (hardware + Cloudera license + internal ops): approximately one-third of the annual Oracle DWH costs. With triple the data capacity. The analytics team processed more ad-hoc analyses in six months than in the previous two years combined. ROI calculated itself.

Big Data Is More Than a Buzzword¶

Hadoop works in enterprise — but it requires realistic expectations, investment in people, and a pragmatic approach to architecture. Don’t replace Oracle on principle. Use Hadoop where Oracle falls short. And above all: start with a concrete business problem, not the technology.

big datahadoopmapreducehdfsenterprise

CORE SYSTEMS

Stavíme core systémy a AI agenty, které drží provoz. 15 let zkušeností s enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.