pymc.variational.parquet_source#

pymc.variational.parquet_source(directory, *, columns=None, pattern='*.parquet')[source]#

An IterableDataset over a directory of Parquet files.

Yields one (rows, n_columns) array per row group (one or more per file), so peak read memory is one row group, not one file. The column order is frozen at construction — columns if given, else the first file’s schema order — and every shard is read in that order, so a shard with a permuted schema cannot silently reorder features mid-stream. Carries an n_rows attribute read from Parquet metadata (no data scan) so that DataLoader(parquet_source(dir), ..., total_size="auto") resolves the dataset size for free. Pass shuffle=True to the DataLoader (or wrap in shuffle_buffer()) to get shuffled batches.