December 19th, 2022

Figuring out how to write partitioned parquet files

Trying to figure out how to write partitioned parquet files using either pandas or using Pyarrow directly (which is used in the background by pandas). The goal is to have a partitioned file for our whole feature sets, such that we can access the features by their partition colums, for example - for particular `gm_id` and `product_id` pairs. This will hopefully reduce the amount of I/O that needs to be done by the workers during hyper parameter tuning.