Parquet

ParquetPool class

source

ParquetPool

 ParquetPool (_cnt:int=0, recipe:configparser.ConfigParser,
              query:tspace.data.core.PoolQuery,
              meta:tspace.data.core.ObservationMeta,
              pl_path:Optional[pathlib.Path]=None,
              logger:Optional[logging.Logger]=None,
              dict_logger:Optional[dict]=None,
              ddf:Optional[dask.dataframe.core.DataFrame]=None)

*The pool class for storing and retrieving records in Apache Arrow parquet files.

It uses Pandas backend for Parquet, PyArrow Parquet interface for meta data storage, and Dask DataFrame for data processing. meta information is stored in parquet metadata (in footer of parquet file).

Sample random observation quadruples will need some care to reassure the randomness. Here we apply dask DataFrame sample method. We use Dask Delayed to parallelize the data processing like sampling

Attributes:

pl_path: `Path` to the parquet file folder
meta: meta information of the pool
query: [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object to the pool
cnt: number of records in the pool
ddf: dask DataFrame object*
# show_doc(ParquetPool.__post_init__)

source

DaskPool.find

 DaskPool.find (query:tspace.data.core.PoolQuery)

*Find records by PoolQuery with

Args:

query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object

return: a DataFrame with all records matching query specification*


source

ParquetPool.get_query

 ParquetPool.get_query (query:Optional[tspace.data.core.PoolQuery]=None)

*get query from dask dataframe parquet storage

Arg: query: PoolQuery object to the pool

Return:

A Dask DataFrame with all records in the query time range*

source

ParquetPool.sample

 ParquetPool.sample (size:int=4, query:tspace.data.core.PoolQuery)

*Sample a batch of records from arrow parquet pool with fractional sampling.

Args: size: number of records in the batch query: PoolQuery object to the pool

Return: A Pandas DataFrame with all records in the query range*

Type Default Details
size int 4
query PoolQuery
Returns pd.DataFrame type: ignore

source

ParquetPool.store

 ParquetPool.store (episode:pandas.core.frame.DataFrame)

Deposit an episode with all records in every time step into arrow parquet.


source

ParquetPool.close

 ParquetPool.close ()

close the pool


source

ParquetPool.load

 ParquetPool.load ()

load RECORD arrays from parquet files in folder specified by the recipe