Dask
DaskPool class
DaskPool
DaskPool (_cnt:int=0, recipe:configparser.ConfigParser, query:tspace.data.core.PoolQuery, meta:tspace.data.core.ObservationMeta, pl_path:Optional[pathlib.Path]=None, logger:Optional[logging.Logger]=None, dict_logger:Optional[dict]=None)
*The pool Class to be derived from as shared common interfaces and attributes for ParquetPool and AvroPool
It has with the following features:
- use Dask dataframe for lazy data processing
- using dask delayed to parallelize the data processing like sampling,
Attributes:
- recipe: a config file for the pool
- pl_path: the pool path, a Path object to the parquet file for RECORD, to avro file for EPISODE
- query: a PoolQuery object
- meta: the meta information for the data collection
- logger: a logger object
- dict_logger: a dictionary logger object*
DaskPool.get_query
DaskPool.get_query (query:Optional[tspace.data.core.PoolQuery]=None)
*Get records by PoolQuery
Args:
query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object
return:
a DataFrame with all records in the query time range*
DaskPool.sample
DaskPool.sample (size:int, query:Optional[tspace.data.core.PoolQuery]=None)
*Sample a batch of data from the pool
An abstract method to be implemented by the derived class ParquetPool
and AvroPool
Args:
size: the number of records to be sampled
query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object
Return: a Pandas DataFrame*
Type | Default | Details | |
---|---|---|---|
size | int | required size of samples | |
query | Optional[PoolQuery] | None | |
Returns | pd.DataFrame | PoolQuery object, query specification |
DaskPool._count
DaskPool._count (query:Optional[tspace.data.core.PoolQuery]=None)
*Count the number of records in the db.
Args:
query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object
Return:
the number of records in the db*
DaskPool.__post_init__
DaskPool.__post_init__ ()
Parsing the recipe and set the pool path
DaskPool.find
DaskPool.find (query:tspace.data.core.PoolQuery)
*Find records by PoolQuery
with
Args:
query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object
return: a DataFrame with all records matching query specification*