.. padawan documentation master file, created by sphinx-quickstart on Fri Jan 20 19:33:54 2023. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to padawan's documentation! =================================== `padawan`_ is a tool for out-of-core processing of partitioned tabular datasets which are too large to hold completely in memory. It uses `polars`_ for representing and manipulating tabular data in memory and the `parquet`_ format for storing partitions on disk. `polars`_ is a library for 'SQL-type' in-memory data manipulation. It provides roughly the same functionality as `pandas`_ but has a cleaner API and superiour performance -- especially on multi-core computers since it consistently utilises all available cores. While `polars`_ has some capabilities for out-of-core processing these are (currently) somewhat limited. For example, `polars`_ cannot handle situations where the result of a computation is too large to fit in memory. This is where `padawan`_ can help. The central object in `padawan`_ is the :py:class:`padawan.Dataset`. It has the semantics of a list of `polars`_ dataframes (``polars.LazyFrame`` objects, to be exact) which represent the partitions. Each dataset specifies a set of *index columns* and keeps track of the upper and lower bounds of the index columns for each partition. This means that certain operations like slicing or joins on the index columns can be carried out efficiently and without visiting the full set of partitions. Furthermore, the supported operations are carried out in a lazy fashion and partitions are only pulled into memory when needed. Contrary to packages like `pyspark`_ or `dask.dataframe`_ `padawan`_ does not attempt to implement its own, complete dataframe API. It focuses on functionality for managing the partitioning (collate, repartition etc.) and on operations which can be done efficiently on partitioned data with known partition boundaries (slicing, joins). All other forms of data manipulation are left to the `polars`_ API which can be accessed directly by mapping a custom function over the partitions via :py:meth:`padawan.Dataset.map`. Since `polars`_ makes efficient use of all available CPUs in most situations, parallelisation can usually be left to `polars`_. However, for cases where a substantial part of the computation is done by the Python interpreter (and is therefore subject to the limitations of the GIL) `padawan`_ also offers a convenient mechanism for parallelising computations via the ``multiprocessing`` module. Furthermore, it uses `cloudpickle`_ to allow the parallelisation of lambda functions. Note that, unlike `pyspark`_ or `dask.dataframe`_ `padawan`_ is only intended for computations on a single node and does not offer functionality for distributing computations on a cluster. .. _padawan: https://github.com/mwiebusch78/padawan .. _polars: https://pola-rs.github.io/polars/py-polars/html/index.html .. _parquet: https://parquet.apache.org/ .. _pandas: https://pandas.pydata.org/ .. _dask.dataframe: https://docs.dask.org/en/stable/dataframe.html .. _pyspark: https://spark.apache.org/docs/latest/api/python/ .. _cloudpickle: https://pypi.org/project/cloudpickle/ .. toctree:: :maxdepth: 2 :caption: Contents: api Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`