pymc.variational.parquet_source#
- pymc.variational.parquet_source(directory, *, columns=None, pattern='*.parquet')[source]#
An
IterableDatasetover a directory of Parquet files.Yields one
(rows, n_columns)array per row group (one or more per file), so peak read memory is one row group, not one file. The column order is frozen at construction —columnsif given, else the first file’s schema order — and every shard is read in that order, so a shard with a permuted schema cannot silently reorder features mid-stream. Carries ann_rowsattribute read from Parquet metadata (no data scan) so thatDataLoader(parquet_source(dir), ..., total_size="auto")resolves the dataset size for free. Passshuffle=Trueto theDataLoader(or wrap inshuffle_buffer()) to get shuffled batches.