Welcome to padawan’s documentation!

padawan is a tool for out-of-core processing of partitioned tabular datasets which are too large to hold completely in memory. It uses polars for representing and manipulating tabular data in memory and the parquet format for storing partitions on disk.

polars is a library for ‘SQL-type’ in-memory data manipulation. It provides roughly the same functionality as pandas but has a cleaner API and superiour performance – especially on multi-core computers since it consistently utilises all available cores. While polars has some capabilities for out-of-core processing these are (currently) somewhat limited. For example, polars cannot handle situations where the result of a computation is too large to fit in memory. This is where padawan can help.

The central object in padawan is the padawan.Dataset. It has the semantics of a list of polars dataframes (polars.LazyFrame objects, to be exact) which represent the partitions. Each dataset specifies a set of index columns and keeps track of the upper and lower bounds of the index columns for each partition. This means that certain operations like slicing or joins on the index columns can be carried out efficiently and without visiting the full set of partitions. Furthermore, the supported operations are carried out in a lazy fashion and partitions are only pulled into memory when needed.

Contrary to packages like pyspark or dask.dataframe padawan does not attempt to implement its own, complete dataframe API. It focuses on functionality for managing the partitioning (collate, repartition etc.) and on operations which can be done efficiently on partitioned data with known partition boundaries (slicing, joins). All other forms of data manipulation are left to the polars API which can be accessed directly by mapping a custom function over the partitions via padawan.Dataset.map().

Since polars makes efficient use of all available CPUs in most situations, parallelisation can usually be left to polars. However, for cases where a substantial part of the computation is done by the Python interpreter (and is therefore subject to the limitations of the GIL) padawan also offers a convenient mechanism for parallelising computations via the multiprocessing module. Furthermore, it uses cloudpickle to allow the parallelisation of lambda functions. Note that, unlike pyspark or dask.dataframe padawan is only intended for computations on a single node and does not offer functionality for distributing computations on a cluster.

Indices and tables