Skip to content
_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN DE
Let's talk

Apache Spark — Distributed Batch Processing of Big Data

04. 06. 2022 Updated: 28. 03. 2026 1 min read intermediate
This article was published in 2022. Some information may be outdated.

Spark is the most widely used engine for big data processing. In-memory computing, PySpark, and Spark SQL.

Spark — Basics

Distributed processing — data is split across cluster nodes.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, year, month, count

spark = SparkSession.builder.appName("Analytics").getOrCreate()

orders = spark.read.parquet("s3://lake/orders/")
customers = spark.read.parquet("s3://lake/customers/")

revenue = (
    orders.filter(col("status") == "completed")
    .join(customers, "customer_id")
    .groupBy(year("order_date").alias("year"), "segment")
    .agg(sum("total_czk").alias("revenue"), count("*").alias("orders"))
)

revenue.write.format("delta").mode("overwrite")
    .save("s3://lake/marts/revenue/")

Optimization

  • Partitioning — by most common filters
  • Caching — repeatedly used datasets
  • Broadcast join — small lookup tables
  • AQE — automatic optimization in Spark 3+

Summary

Spark is the standard for distributed processing. PySpark and Spark SQL cover both ETL and analytics.

apache sparkbatch processingbig datapyspark
Share:

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.