pymc.variational.shuffle_buffer#
- pymc.variational.shuffle_buffer(chunk_source, *, buffer_size, batch_size, seed=None)[source]#
Wrap a chunk source into a shuffled, fixed-size batch source.
Accumulates rows from
chunk_sourceinto a buffer of at leastbuffer_sizerows, shuffles it, and yieldsbatch_sizeslices; rows that do not fill a final batch are carried over into the next buffer (never dropped) until the source is exhausted, at which point a single trailing partial batch (<batch_sizerows) is dropped. This approximates i.i.d. minibatches from an unordered or pre-shuffled stream.DataLoadercalls this for you whenshuffle=True; use it directly when you want explicit control overbuffer_sizeindependently of the loader.It does not by itself fix a strongly time/row-ordered stream (a bounded buffer only block-shuffles such data); pre-shuffle on disk, or interleave shards into
chunk_source, for that.buffer_sizeis a lower bound: each fill accumulates at leastmax(buffer_size, batch_size)rows before shuffling (so abuffer_sizesmaller thanbatch_sizestill yields full batches; the final fill stops at whatever the source has left), and the chunk that crosses the threshold is kept whole, so the buffer holds fewer thanmax(buffer_size, batch_size)plus one chunk’s rows. Concatenating a fill into one shuffleable array transiently allocates a second copy of those rows, so peak allocation is about twice that bound.Each epoch (each call of the returned factory) draws a fresh permutation from a sub-stream of
seed, so the shuffle order differs across epochs while staying reproducible for a givenseed.