Dask

DaskPool class

source

DaskPool

 DaskPool (_cnt:int=0, recipe:configparser.ConfigParser,
           query:tspace.data.core.PoolQuery,
           meta:tspace.data.core.ObservationMeta,
           pl_path:Optional[pathlib.Path]=None,
           logger:Optional[logging.Logger]=None,
           dict_logger:Optional[dict]=None)

*The pool Class to be derived from as shared common interfaces and attributes for ParquetPool and AvroPool

It has with the following features:

- use Dask dataframe for lazy data processing
- using dask delayed to parallelize the data processing like sampling,

Attributes:

- recipe: a config file for the pool
- pl_path: the pool path, a Path object to the parquet file for RECORD, to avro file for EPISODE
- query: a PoolQuery object
- meta: the meta information for the data collection
- logger: a logger object
- dict_logger: a dictionary logger object*

source

DaskPool.get_query

 DaskPool.get_query (query:Optional[tspace.data.core.PoolQuery]=None)

*Get records by PoolQuery

Args:

query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object

return:

a DataFrame with all records in the query time range*

source

DaskPool.sample

 DaskPool.sample (size:int,
                  query:Optional[tspace.data.core.PoolQuery]=None)

*Sample a batch of data from the pool

An abstract method to be implemented by the derived class ParquetPool and AvroPool

Args:

size: the number of records to be sampled
query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object

Return: a Pandas DataFrame*

Type Default Details
size int required size of samples
query Optional[PoolQuery] None
Returns pd.DataFrame PoolQuery object, query specification

source

DaskPool._count

 DaskPool._count (query:Optional[tspace.data.core.PoolQuery]=None)

*Count the number of records in the db.

Args:

query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object

Return:

    the number of records in the db*

source

DaskPool.__post_init__

 DaskPool.__post_init__ ()

Parsing the recipe and set the pool path


source

DaskPool.find

 DaskPool.find (query:tspace.data.core.PoolQuery)

*Find records by PoolQuery with

Args:

query: a [`PoolQuery`](https://Binjian.github.io/tspace/01.data.core.html#poolquery) object

return: a DataFrame with all records matching query specification*