Know-How

Batch Processing

6 min čtení

BatchProcessingPerformancePython

Zpracování milionů záznamů najednou = out of memory. Chunking, streaming a paralelismus jsou řešení.

Chunking

# Python — zpracování po 1000 def process_in_chunks(query, chunk_size=1000): offset = 0 while True: chunk = db.execute(query.limit(chunk_size).offset(offset)).fetchall() if not chunk: break for row in chunk: process(row) db.commit() offset += chunk_size

Server-side cursor (PostgreSQL)

# SQLAlchemy — server-side cursor with engine.connect().execution_options(stream_results=True) as conn: result = conn.execute(text("SELECT * FROM big_table")) for chunk in result.partitions(1000): for row in chunk: process(row)

Paralelismus

from concurrent.futures import ProcessPoolExecutor with ProcessPoolExecutor(max_workers=4) as executor: futures = [executor.submit(process_chunk, chunk) for chunk in chunks] results = [f.result() for f in futures]

Klíčový takeaway

Chunking pro memory efficiency, server-side cursors pro streaming, ProcessPoolExecutor pro CPU-bound.

CORE SYSTEMS tým

Praktické know-how z reálných projektů. Bez buzzwordů, s kódem.