Tables#
The Table Class#
- class quivr.Table(table, **kwargs)#
Table is the primary data structure in quivr.
Tables are used to represent tabular data with a fixed schema. The schema is defined by subclassing Table, providing Column objects as class attributes. The Table class will then generate a pyarrow schema from those columns.
Table instances can then be created from data, either by passing in a pyarrow Table, or by passing in data in a variety of other formats. The data will be validated against the schema, and converted to a pyarrow Table.
Table instances are immutable, but can be sliced, filtered, sorted, or otherwise manipulated, resulting in new Table instances. In particular, see the
Table.set_column()method, which returns a copy of the Table with a single column replaced.- Variables:
schema (pyarrow.Schema) – The pyarrow schema for this table.
table (pyarrow.Table) – The underlying
pyarrow.Tablefor this Table instance.
Constructors#
Warning
Table instances should only be instantiated using the
constructors. The Table.__init__ method only accepts a “raw”
pyarrow.Table, does not validate its arguments (not even to
check that the table has appropriate columns), and does not allow
setting Table Attributes.
See also: Serialization/Deserialization.
- classmethod Table.from_kwargs(validate=True, **kwargs)#
Create a Table instance from keyword arguments.
Each keyword argument corresponds to a column in the Table.
The keys should correspond to column names or attribute names.
For columns, the values should be arrays, lists, or pyarrow Arrays.
For attributes, the values should be the appropriate type for that attribute.
- Parameters:
validate (
bool) – If (the default), run column validators on all input data.**kwargs (Union[
AttributeValueType,DataSourceType]) – The data to populate the table with.
- Return type:
Self
- classmethod Table.from_pyarrow(table, validate=True, permit_nulls=False, **kwargs)#
Create a new table from a pyarrow Table.
This is a convenience method which can be used to create a Table from a pyarrow Table. It can also accept keyword-style arguments to set attributes on the table.
When serializing to pyarrow, the table’s schema metadata encodes the value of attributes. This method will use that metadata to set the attributes on the table if it is available. If any attributes are provided as keyword arguments, they override any values in the metadata.
- Parameters:
table (
Table) – The pyarrow Table to create the table from.validate (
bool) – Whether to validate the table against the schema, and run any column validators.permit_nulls (
bool) – Whether to permit null values in the table. If True, nulls will be permitted, even in non-nullable fields. This is used when a Table is used as a nullable subtable.**kwargs (
AttributeValueType) – Keyword arguments to set attributes on the table.
- Return type:
Self- Returns:
A new Table instance.
- classmethod Table.from_dataframe(df, validate=True, **kwargs)#
Load a DataFrame into the Table.
If the DataFrame is missing any of the Table’s columns, an error is raised. If the DataFrame has extra columns, they are ignored.
This function cannot load “flattened” dataframes. This only matters for nested Tables which contain other Table definitions as columns. For that use case, either load an unflattened DataFrame, or use from_flat_dataframe.
- Parameters:
df (
DataFrame) – A pandas DataFrame containing the data to load.validate (
bool) – Whether to validate the data after loading.**kwargs (
AttributeValueType) – Additional keyword arguments for any Table attributes.
- Return type:
Self
- classmethod Table.from_flat_dataframe(df, validate=True, **kwargs)#
Load a flattened DataFrame into the Table.
Caution
Known bug: Doesn’t correctly interpret fixed-length lists.
- Parameters:
df (
DataFrame) – A pandas DataFrame containing the data to load.validate (
bool) – Whether to validate the data after loading.**kwargs (
AttributeValueType) – Additional keyword arguments for any Table attributes.
- Return type:
Self
- classmethod Table.empty(**kwargs)#
Create an empty instance of the table.
- Parameters:
**kwargs – Additional keyword arguments to set the Table’s attributes.
- Return type:
Self
- Table.with_table(table)#
- Return type:
Self
Serialization/Deserialization#
These methods handle serializing the Table to and from other formats, most often for writing to or reading from disk.
- classmethod Table.from_parquet(path, memory_map=False, pq_buffer_size=0, filters=None, column_name_map=None, validate=True, **kwargs)#
Read a table from a Parquet file.
- Parameters:
path (
str) – The path to the Parquet file.memory_map (
bool) – If True, memory-map the file, otherwise read it into memory.pq_buffer_size (
int) – If positive, perform read buffering when deserializing individual column chunks. Otherwise, IO calls are unbuffered.filters (
Optional[Expression]) – An optional filter predicate to apply to the data. Rows which do not match the predicate will be removed from scanned data. For more information, see the PyArrow documentation on pyarrow.parquet.read_table and its filter parameter.column_name_map (
Optional[dict[str,str]]) – An optional dictionary mapping column names in the Parquet file to column names in the resulting Table. This is useful if the Parquet file contains column names that are not valid Python identifiers, or if you want to rename columns for any other reason.validate (
bool) – Whether to run column validation on the resulting Table.**kwargs – Additional keyword arguments to pass to Self’s __init__ method.
- Return type:
Self
- classmethod Table.from_feather(path, validate=True, **kwargs)#
Read a table from a Feather file.
- classmethod Table.from_csv(input_file, validate=True, **kwargs)#
Read a table from a CSV file.
- Table.to_parquet(path, **kwargs)#
Write the table to a Parquet file.
- Parameters:
path (
str) – The path to write the Parquet file to.kwargs – Additional arguments to pass to pyarrow.parquet.write_table.
- Table.to_feather(path, **kwargs)#
Write the table to a Feather file.
- Parameters:
path (
str) – The path to write the Feather file to.kwargs – Additional arguments to pass to pyarrow.feather.write_feather.
- Table.to_csv(path, attribute_columns=True)#
Write the table to a CSV file. Any nested structure is flattened.
Validation#
- Table.validate()#
Validate the table against the schema, raising an exception if invalid.
Interoperability#
- classmethod Table.as_column(nullable=True, metadata=None)#
Embed the Table as a column in another Table.
This method is the primary way to achieve composition of Tables with quivr.
- Parameters:
nullable (
bool) – Whether the column can contain nulls. Note that this refers to whether an entire row - all columns - can be null, not whether a single value in one column of this table can be null. That is controlled entirely by this Table class’s columns.metadata (
Optional[Dict[Union[bytes,str],Union[bytes,str]]]) – Metadata to attach to the column.
- Return type:
SubTableColumn[Self]
- Table.to_structarray()#
Returns self as a StructArray.
This only works if self is not fragmented. Call table = defragment(table) if table.fragmented() is True.
- Raises:
TableFragmentedError – if the table is fragmented.
- Return type:
- Table.flattened_table()#
Completely flatten the Table’s underlying Arrow table, taking into account any nested structure, and return the data table itself.
- Return type:
- Table.to_dataframe(flatten=True)#
Returns self as a pandas DataFrame.
If flatten is true, then any nested hierarchy is flattened: if the Table’s schema contains a struct named “foo” with field “a”, “b”, and “c”, then the resulting DataFrame will include columns “foo.a”, “foo.b”, “foo.c”. This is done fully for any deeply nested structure, for example “foo.bar.baz.c”.
If flatten is false, then that struct will be in a single “foo” column, and the values will of the column will be dictionaries representing the struct values.
Filtering, Selection, Sorting#
- Table.select(column_name, value)#
Select from the table by exact match, returning a new Table which only contains rows for which the value in column_name equals value.
- Table.sort_by(by)#
Sorts the Table by the given column name (or multiple columns). This operation requires a copy, and returns a new Table using the copied data.
by should be a column name to sort by, or a list of (column, order) tuples, where order can be “ascending” or “descending”.
- Table.take(row_indices)#
Return a new Table with only the rows at the given indices.
- Parameters:
row_indices (
Union[list[int],IntegerArray]) – The indices of the rows to return.- Return type:
Self
- Table.apply_mask(mask)#
Return a new table with rows filtered to match a boolean mask.
The mask must have the same length as the table. At each index, if the mask’s value is True, the row will be included in the new table; if False, it will be excluded.
If the mask is a pyarrow BooleanArray, it must not have any null values.
- Table.where(expr)#
Return a new table with rows filtered to match an expression.
The expression must be a pyarrow Expression that evaluates to a boolean array.
- Parameters:
expr (
Expression) – A pyarrow Expression to apply to the table.
- Examples:
>>> import quivr as qv >>> import pyarrow.compute as pc >>> class MyTable(qv.Table): ... x = qv.Int64Column() ... y = qv.Int64Column() >>> t = MyTable.from_kwargs(x=[1, 2, 3], y=[4, 5, 6]) >>> filtered = t.where(pc.field("x") > 1) >>> print(filtered.x.to_pylist()) [2, 3] >>> print(filtered.y.to_pylist()) [5, 6]
- Return type:
Self
- Table.__getitem__(idx)#
Returns a new Table containing the given row or rows.
Data Internals#
- Table.chunk_counts()#
Returns the number of discrete memory chunks that make up each of the Table’s underlying arrays. The keys of the resulting dictionary are the column names, and the values are the number of chunks for that column’s data.
- Table.fragmented()#
Returns true if the Table has any fragmented arrays. If this is the case, performance might be improved by calling defragment on it.
- Return type:
Miscellaneous#
- Table.__eq__(other)#
Returns true if the two Tables are equal. They are considered equal if they have the same data in their tables, and identical attributes and attribute values.
Utility Functions#
- quivr.concatenate(values, defrag=True)#
Concatenate a collection of Tables into a single Table.
All input Tables be of the same class, and have the same attribute values (if any).
By default, results are compacted to be contiguous in memory, which involves a copy. In a tight loop, this can be very inefficient, so you can set the ‘defrag’ parameter to False to skip this compaction step, and instead call
defragment()on the result after the loop is complete.
Type Information Helpers#
- quivr.tables.AnyTable = TypeVar(AnyTable, bound=Table)#
Invariant
TypeVarbound toquivr.tables.Table.
- quivr.DataSourceType#
Represents the permitted set of types that can be used to initialize a Table instance’s data columns. alias of
Union[Array,list[Any],Table,Series,ndarray[Any,dtype[Any]],ArrowArrayProvider]