Apache Spark — Distributed Batch Processing of Big Data

Spark is the most widely used engine for big data processing. In-memory computing, PySpark, and Spark SQL.

Spark — Basics¶

Distributed processing — data is split across cluster nodes.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, year, month, count

spark = SparkSession.builder.appName("Analytics").getOrCreate()

orders = spark.read.parquet("s3://lake/orders/")
customers = spark.read.parquet("s3://lake/customers/")

revenue = (
    orders.filter(col("status") == "completed")
    .join(customers, "customer_id")
    .groupBy(year("order_date").alias("year"), "segment")
    .agg(sum("total_czk").alias("revenue"), count("*").alias("orders"))
)

revenue.write.format("delta").mode("overwrite")
    .save("s3://lake/marts/revenue/")

Optimization¶

Partitioning — by most common filters
Caching — repeatedly used datasets
Broadcast join — small lookup tables
AQE — automatic optimization in Spark 3+

Summary¶

Spark is the standard for distributed processing. PySpark and Spark SQL cover both ETL and analytics.

apache sparkbatch processingbig datapyspark

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles

Apache Spark — Distributed Batch Processing of Big Data

Spark — Basics¶

Optimization¶

Summary¶

CORE SYSTEMS team

More know-how

Hadoop and Big Data in Enterprise Environments

Big Data and Hadoop in Enterprise

Apache Cassandra — Distributed Database for Large-Scale Data

Hadoop Ecosystem — HDFS, YARN and Modern Alternatives