pymc.variational.shuffle_buffer#

pymc.variational.shuffle_buffer(chunk_source, *, buffer_size, batch_size, seed=None)[source]#

Wrap a chunk source into a shuffled, fixed-size batch source.

Accumulates rows from chunk_source into a buffer of at least buffer_size rows, shuffles it, and yields batch_size slices; rows that do not fill a final batch are carried over into the next buffer (never dropped) until the source is exhausted, at which point a single trailing partial batch (< batch_size rows) is dropped. This approximates i.i.d. minibatches from an unordered or pre-shuffled stream.

DataLoader calls this for you when shuffle=True; use it directly when you want explicit control over buffer_size independently of the loader.

It does not by itself fix a strongly time/row-ordered stream (a bounded buffer only block-shuffles such data); pre-shuffle on disk, or interleave shards into chunk_source, for that. buffer_size is a lower bound: each fill accumulates at least max(buffer_size, batch_size) rows before shuffling (so a buffer_size smaller than batch_size still yields full batches; the final fill stops at whatever the source has left), and the chunk that crosses the threshold is kept whole, so the buffer holds fewer than max(buffer_size, batch_size) plus one chunk’s rows. Concatenating a fill into one shuffleable array transiently allocates a second copy of those rows, so peak allocation is about twice that bound.

Each epoch (each call of the returned factory) draws a fresh permutation from a sub-stream of seed, so the shuffle order differs across epochs while staying reproducible for a given seed.