Tables#

The Table Class#

class quivr.Table(table, **kwargs)#

Table is the primary data structure in quivr.

Tables are used to represent tabular data with a fixed schema. The schema is defined by subclassing Table, providing Column objects as class attributes. The Table class will then generate a pyarrow schema from those columns.

Table instances can then be created from data, either by passing in a pyarrow Table, or by passing in data in a variety of other formats. The data will be validated against the schema, and converted to a pyarrow Table.

Table instances are immutable, but can be sliced, filtered, sorted, or otherwise manipulated, resulting in new Table instances. In particular, see the Table.set_column() method, which returns a copy of the Table with a single column replaced.

Variables:

Constructors#

Warning

Table instances should only be instantiated using the constructors. The Table.__init__ method only accepts a “raw” pyarrow.Table, does not validate its arguments (not even to check that the table has appropriate columns), and does not allow setting Table Attributes.

See also: Serialization/Deserialization.

classmethod Table.from_kwargs(validate=True, **kwargs)#

Create a Table instance from keyword arguments.

Each keyword argument corresponds to a column in the Table.

The keys should correspond to column names or attribute names.

For columns, the values should be arrays, lists, or pyarrow Arrays.

For attributes, the values should be the appropriate type for that attribute.

Parameters:
Return type:

Self

classmethod Table.from_pyarrow(table, validate=True, permit_nulls=False, **kwargs)#

Create a new table from a pyarrow Table.

This is a convenience method which can be used to create a Table from a pyarrow Table. It can also accept keyword-style arguments to set attributes on the table.

When serializing to pyarrow, the table’s schema metadata encodes the value of attributes. This method will use that metadata to set the attributes on the table if it is available. If any attributes are provided as keyword arguments, they override any values in the metadata.

Parameters:
  • table (Table) – The pyarrow Table to create the table from.

  • validate (bool) – Whether to validate the table against the schema, and run any column validators.

  • permit_nulls (bool) – Whether to permit null values in the table. If True, nulls will be permitted, even in non-nullable fields. This is used when a Table is used as a nullable subtable.

  • **kwargs (AttributeValueType) – Keyword arguments to set attributes on the table.

Return type:

Self

Returns:

A new Table instance.

classmethod Table.from_dataframe(df, validate=True, **kwargs)#

Load a DataFrame into the Table.

If the DataFrame is missing any of the Table’s columns, an error is raised. If the DataFrame has extra columns, they are ignored.

This function cannot load “flattened” dataframes. This only matters for nested Tables which contain other Table definitions as columns. For that use case, either load an unflattened DataFrame, or use from_flat_dataframe.

Parameters:
  • df (DataFrame) – A pandas DataFrame containing the data to load.

  • validate (bool) – Whether to validate the data after loading.

  • **kwargs (AttributeValueType) – Additional keyword arguments for any Table attributes.

Return type:

Self

classmethod Table.from_flat_dataframe(df, validate=True, **kwargs)#

Load a flattened DataFrame into the Table.

Caution

Known bug: Doesn’t correctly interpret fixed-length lists.

Parameters:
  • df (DataFrame) – A pandas DataFrame containing the data to load.

  • validate (bool) – Whether to validate the data after loading.

  • **kwargs (AttributeValueType) – Additional keyword arguments for any Table attributes.

Return type:

Self

classmethod Table.empty(**kwargs)#

Create an empty instance of the table.

Parameters:

**kwargs – Additional keyword arguments to set the Table’s attributes.

Return type:

Self

Table.with_table(table)#
Return type:

Self

Serialization/Deserialization#

These methods handle serializing the Table to and from other formats, most often for writing to or reading from disk.

classmethod Table.from_parquet(path, memory_map=False, pq_buffer_size=0, filters=None, column_name_map=None, validate=True, **kwargs)#

Read a table from a Parquet file.

Parameters:
  • path (str) – The path to the Parquet file.

  • memory_map (bool) – If True, memory-map the file, otherwise read it into memory.

  • pq_buffer_size (int) – If positive, perform read buffering when deserializing individual column chunks. Otherwise, IO calls are unbuffered.

  • filters (Optional[Expression]) – An optional filter predicate to apply to the data. Rows which do not match the predicate will be removed from scanned data. For more information, see the PyArrow documentation on pyarrow.parquet.read_table and its filter parameter.

  • column_name_map (Optional[dict[str, str]]) – An optional dictionary mapping column names in the Parquet file to column names in the resulting Table. This is useful if the Parquet file contains column names that are not valid Python identifiers, or if you want to rename columns for any other reason.

  • validate (bool) – Whether to run column validation on the resulting Table.

  • **kwargs – Additional keyword arguments to pass to Self’s __init__ method.

Return type:

Self

classmethod Table.from_feather(path, validate=True, **kwargs)#

Read a table from a Feather file.

Parameters:
  • path (str) – The path to the Feather file.

  • validate (bool) – Whether to run column validators on the table after loading it.

  • **kwargs – Additional keyword arguments to pass to Self’s __init__ method.

Return type:

Self

classmethod Table.from_csv(input_file, validate=True, **kwargs)#

Read a table from a CSV file.

Parameters:
  • input_file (Union[str, PathLike, IOBase]) – The path to the CSV file, or a file-like object.

  • **kwargs – Additional keyword arguments to set the Table’s attributes.

Return type:

Self

Table.to_parquet(path, **kwargs)#

Write the table to a Parquet file.

Parameters:
  • path (str) – The path to write the Parquet file to.

  • kwargs – Additional arguments to pass to pyarrow.parquet.write_table.

Table.to_feather(path, **kwargs)#

Write the table to a Feather file.

Parameters:
  • path (str) – The path to write the Feather file to.

  • kwargs – Additional arguments to pass to pyarrow.feather.write_feather.

Table.to_csv(path, attribute_columns=True)#

Write the table to a CSV file. Any nested structure is flattened.

Parameters:
  • path (str) – The path to write the CSV file to.

  • attribute_columns (bool) – If True, store any Attributes defined for the table (or its subtable columns) as columns in the CSV file. If False, do not store any Attribute data in the CSV file.

Validation#

Table.validate()#

Validate the table against the schema, raising an exception if invalid.

Table.is_valid()#

Validate the table against the schema.

Return type:

bool

Interoperability#

classmethod Table.as_column(nullable=True, metadata=None)#

Embed the Table as a column in another Table.

This method is the primary way to achieve composition of Tables with quivr.

Parameters:
  • nullable (bool) – Whether the column can contain nulls. Note that this refers to whether an entire row - all columns - can be null, not whether a single value in one column of this table can be null. That is controlled entirely by this Table class’s columns.

  • metadata (Optional[Dict[Union[bytes, str], Union[bytes, str]]]) – Metadata to attach to the column.

Return type:

SubTableColumn[Self]

Table.to_structarray()#

Returns self as a StructArray.

This only works if self is not fragmented. Call table = defragment(table) if table.fragmented() is True.

Raises:

TableFragmentedError – if the table is fragmented.

Return type:

StructArray

Table.flattened_table()#

Completely flatten the Table’s underlying Arrow table, taking into account any nested structure, and return the data table itself.

Return type:

Table

Table.to_dataframe(flatten=True)#

Returns self as a pandas DataFrame.

If flatten is true, then any nested hierarchy is flattened: if the Table’s schema contains a struct named “foo” with field “a”, “b”, and “c”, then the resulting DataFrame will include columns “foo.a”, “foo.b”, “foo.c”. This is done fully for any deeply nested structure, for example “foo.bar.baz.c”.

If flatten is false, then that struct will be in a single “foo” column, and the values will of the column will be dictionaries representing the struct values.

Parameters:

flatten (bool) – Whether to flatten the table’s structure.

Return type:

DataFrame

Table.column(column_name)#

Returns the column with the given name as a raw pyarrow ChunkedArray.

Parameters:

column_name (str) – The name of the column to return.

Return type:

ChunkedArray

Filtering, Selection, Sorting#

Table.select(column_name, value)#

Select from the table by exact match, returning a new Table which only contains rows for which the value in column_name equals value.

Parameters:
  • column_name (str) – The name of the column to select on.

  • value (Any) – The value to match.

Return type:

Self

Table.sort_by(by)#

Sorts the Table by the given column name (or multiple columns). This operation requires a copy, and returns a new Table using the copied data.

by should be a column name to sort by, or a list of (column, order) tuples, where order can be “ascending” or “descending”.

Parameters:

by (Union[str, list[tuple[str, str]]]) – The column name or list of (column, order) tuples to sort by.

Return type:

Self

Table.take(row_indices)#

Return a new Table with only the rows at the given indices.

Parameters:

row_indices (Union[list[int], IntegerArray]) – The indices of the rows to return.

Return type:

Self

Table.apply_mask(mask)#

Return a new table with rows filtered to match a boolean mask.

The mask must have the same length as the table. At each index, if the mask’s value is True, the row will be included in the new table; if False, it will be excluded.

If the mask is a pyarrow BooleanArray, it must not have any null values.

Parameters:

mask (Union[BooleanArray, ndarray[bool, Any], list[bool]]) – A boolean mask to apply to the table.

Return type:

Self

Table.where(expr)#

Return a new table with rows filtered to match an expression.

The expression must be a pyarrow Expression that evaluates to a boolean array.

Parameters:

expr (Expression) – A pyarrow Expression to apply to the table.

Examples:
>>> import quivr as qv
>>> import pyarrow.compute as pc
>>> class MyTable(qv.Table):
...     x = qv.Int64Column()
...     y = qv.Int64Column()
>>> t = MyTable.from_kwargs(x=[1, 2, 3], y=[4, 5, 6])
>>> filtered = t.where(pc.field("x") > 1)
>>> print(filtered.x.to_pylist())
[2, 3]
>>> print(filtered.y.to_pylist())
[5, 6]
Return type:

Self

Table.__getitem__(idx)#

Returns a new Table containing the given row or rows.

Parameters:

idx (Union[int, slice]) – The row index or slice to return.

Return type:

Self

Table.__iter__()#

Iterates over the rows of the Table, returning a new Table containing each row.

Return type:

Iterator[Self]

Data Internals#

Table.chunk_counts()#

Returns the number of discrete memory chunks that make up each of the Table’s underlying arrays. The keys of the resulting dictionary are the column names, and the values are the number of chunks for that column’s data.

Return type:

dict[str, int]

Table.fragmented()#

Returns true if the Table has any fragmented arrays. If this is the case, performance might be improved by calling defragment on it.

Return type:

bool

Table.attributes()#

Return a dictionary of the table’s attributes.

Return type:

dict[str, Any]

Miscellaneous#

Table.__repr__()#

Return repr(self).

Return type:

str

Table.__len__()#

Returns the number of rows in the Table.

Return type:

int

Table.__eq__(other)#

Returns true if the two Tables are equal. They are considered equal if they have the same data in their tables, and identical attributes and attribute values.

Parameters:

other (Any) – The other Table to compare to.

Return type:

bool

Table.set_column(name, data)#

Return a copy of the table with a particular column replaced with new data.

Parameters:
Return type:

Self

Utility Functions#

quivr.concatenate(values, defrag=True)#

Concatenate a collection of Tables into a single Table.

All input Tables be of the same class, and have the same attribute values (if any).

By default, results are compacted to be contiguous in memory, which involves a copy. In a tight loop, this can be very inefficient, so you can set the ‘defrag’ parameter to False to skip this compaction step, and instead call defragment() on the result after the loop is complete.

Parameters:
  • values (Iterator[~AnyTable]) – An iterator of Table instances to concatenate.

  • defrag (bool) – Whether to compact the result to be contiguous in memory. Defaults to True.

Return type:

~AnyTable

quivr.defragment(table)#

Condense the underlying memory which backs the table to make it all contiguous. This makes many operations more efficient after defragmentation is complete.

Parameters:

table (~AnyTable) – The table to defragment.

Return type:

~AnyTable

Returns:

The defragmented table.

Type Information Helpers#

quivr.tables.AnyTable = TypeVar(AnyTable, bound=Table)#

Invariant TypeVar bound to quivr.tables.Table.

quivr.DataSourceType#

Represents the permitted set of types that can be used to initialize a Table instance’s data columns. alias of Union[Array, list[Any], Table, Series, ndarray[Any, dtype[Any]], ArrowArrayProvider]

quivr.AttributeValueType#

Represents the permitted set of values to be passed in when setting a Table attribute value. alias of Union[int, float, str]

class quivr.ArrowArrayProvider(*args, **kwargs)#

A Protocol which describes objects that support the Arrow custom array extension protocol.

__arrow_array__(type=None)#
Return type:

Array