Welcome to padawan’s documentation!¶
padawan is a tool for out-of-core processing of partitioned tabular datasets which are too large to hold completely in memory. It uses polars for representing and manipulating tabular data in memory and the parquet format for storing partitions on disk.
polars is a library for ‘SQL-type’ in-memory data manipulation. It provides roughly the same functionality as pandas but has a cleaner API and superiour performance – especially on multi-core computers since it consistently utilises all available cores. While polars has some capabilities for out-of-core processing these are (currently) somewhat limited. For example, polars cannot handle situations where the result of a computation is too large to fit in memory. This is where padawan can help.
The central object in padawan is the padawan.Dataset. It has the
semantics of a list of polars dataframes (polars.LazyFrame objects, to
be exact) which represent the partitions. Each dataset specifies a set of
index columns and keeps track of the upper and lower bounds of the index
columns for each partition. This means that certain operations like slicing or
joins on the index columns can be carried out efficiently and without visiting
the full set of partitions. Furthermore, the supported operations are carried
out in a lazy fashion and partitions are only pulled into memory when needed.
Contrary to packages like pyspark or dask.dataframe padawan does not
attempt to implement its own, complete dataframe API. It focuses on
functionality for managing the partitioning (collate, repartition etc.) and on
operations which can be done efficiently on partitioned data with known
partition boundaries (slicing, joins). All other forms of data manipulation are
left to the polars API which can be accessed directly by mapping a custom
function over the partitions via padawan.Dataset.map().
Since polars makes efficient use of all available CPUs in most situations,
parallelisation can usually be left to polars. However, for cases where a
substantial part of the computation is done by the Python interpreter (and is
therefore subject to the limitations of the GIL) padawan also offers a
convenient mechanism for parallelising computations via the multiprocessing
module. Furthermore, it uses cloudpickle to allow the parallelisation of
lambda functions. Note that, unlike pyspark or dask.dataframe padawan
is only intended for computations on a single node and does not offer
functionality for distributing computations on a cluster.
Contents: