# show_doc(ParquetPool.__post_init__)
Parquet
ParquetPool
ParquetPool (_cnt:int=0, recipe:configparser.ConfigParser, query:tspace.data.core.PoolQuery, meta:tspace.data.core.ObservationMeta, pl_path:Optional[pathlib.Path]=None, logger:Optional[logging.Logger]=None, dict_logger:Optional[dict]=None, ddf:Optional[dask.dataframe.core.DataFrame]=None)
*The pool class for storing and retrieving records in Apache Arrow parquet files.
It uses Pandas backend for Parquet, PyArrow Parquet interface for meta data storage, and Dask DataFrame for data processing. meta information is stored in parquet metadata (in footer of parquet file).
Sample random observation quadruples will need some care to reassure the randomness. Here we apply dask DataFrame sample method. We use Dask Delayed to parallelize the data processing like sampling
Attributes:
pl_path: `Path` to the parquet file folder
meta: meta information of the pool
query: [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object to the pool
cnt: number of records in the pool
ddf: dask DataFrame object*
DaskPool.find
DaskPool.find (query:tspace.data.core.PoolQuery)
*Find records by PoolQuery
with
Args:
query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object
return: a DataFrame with all records matching query specification*
ParquetPool.get_query
ParquetPool.get_query (query:Optional[tspace.data.core.PoolQuery]=None)
*get query from dask dataframe parquet storage
Arg: query: PoolQuery
object to the pool
Return:
A Dask DataFrame with all records in the query time range*
ParquetPool.sample
ParquetPool.sample (size:int=4, query:tspace.data.core.PoolQuery)
*Sample a batch of records from arrow parquet pool with fractional sampling.
Args: size: number of records in the batch query: PoolQuery
object to the pool
Return: A Pandas DataFrame with all records in the query range*
Type | Default | Details | |
---|---|---|---|
size | int | 4 | |
query | PoolQuery | ||
Returns | pd.DataFrame | type: ignore |
ParquetPool.store
ParquetPool.store (episode:pandas.core.frame.DataFrame)
Deposit an episode with all records in every time step into arrow parquet.
ParquetPool.close
ParquetPool.close ()
close the pool
ParquetPool.load
ParquetPool.load ()
load RECORD arrays from parquet files in folder specified by the recipe