Know-How
Batch Processing
Zpracování milionů záznamů najednou = out of memory. Chunking, streaming a paralelismus jsou řešení.
Chunking
# Python — zpracování po 1000
def process_in_chunks(query, chunk_size=1000):
offset = 0
while True:
chunk = db.execute(query.limit(chunk_size).offset(offset)).fetchall()
if not chunk:
break
for row in chunk:
process(row)
db.commit()
offset += chunk_size
Server-side cursor (PostgreSQL)
# SQLAlchemy — server-side cursor
with engine.connect().execution_options(stream_results=True) as conn:
result = conn.execute(text("SELECT * FROM big_table"))
for chunk in result.partitions(1000):
for row in chunk:
process(row)
Paralelismus
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as executor:
futures = [executor.submit(process_chunk, chunk) for chunk in chunks]
results = [f.result() for f in futures]
Klíčový takeaway
Chunking pro memory efficiency, server-side cursors pro streaming, ProcessPoolExecutor pro CPU-bound.