Interoperability#

quivr is designed to work well with the rest of the Python data ecosystem. Tables have methods and tools for working with Pandas, Numpy, and Parquet.

Throughout this guide, the following basic Table definitions will be used:

import quivr as qv

class Positions(qv.Table):
    x = qv.Float64Column()
    y = qv.Float64Column()

class Measurers(qv.Table):
    id = qv.UInt32Column()
    name = qv.StringColumn()

class Measurements(qv.Table):
    position = Positions.as_column()
    measurer = Measurers.as_column()
    dataset = qv.StringAttribute()

Instantiated with these values:

positions = Positions.from_kwargs(
    x=[4.8, 4.9, None, 8.0, 5.3],
    y=[3.1, None, 3.2, 3.2, 3.3],
)
measurers = Measurers.from_kwargs(
    id=[0, 1, 1, None, 0],
    name=["Alice", "Bob", "Bob", None, "Alice"],
)
measurements = Measurements.from_kwargs(
    position=positions,
    measurer=measurers,
    dataset="atlas",
)

Pandas#

Tables can be converted to and from Pandas DataFrames.

To Pandas#

df = measurements.to_dataframe()
print(df)
      position.x  position.y  measurer.id measurer.name
0         4.8         3.1          0.0         Alice
1         4.9         NaN          1.0           Bob
2         NaN         3.2          1.0           Bob
3         8.0         3.2          NaN          None
4         5.3         3.3          0.0         Alice

Column names in the dataframe correspond to the field names in the Table instance. For subtables, column names are dot-delimited, as you can see above.

This is called a “flattened” dataframe. You can choose an unflattened form, if you prefer, which results in Python dictionaries at each “cell” of the pandas DataFrame:

df = measurements.to_dataframe(flatten=False)
print(df)
              position                      measurer
0   {'x': 4.8, 'y': 3.1}  {'id': 0.0, 'name': 'Alice'}
1  {'x': 4.9, 'y': None}    {'id': 1.0, 'name': 'Bob'}
2  {'x': None, 'y': 3.2}    {'id': 1.0, 'name': 'Bob'}
3   {'x': 8.0, 'y': 3.2}    {'id': None, 'name': None}
4   {'x': 5.3, 'y': 3.3}  {'id': 0.0, 'name': 'Alice'}

Types might get converted during to_dataframe. This conversion follows behavior set by the PyArrow library. If a numeric column has any null values, the column is converted to 64-bit floating point values and the nulls are converted into NaN values.

Table Attributes can be preserved in this conversion to a DataFrame. By default, attributes are stored on the pandas.DataFrame.attrs attribute, which is a dictionary of global attributes for the DataFrame.

df = measurements.to_dataframe()
print(df.attrs)
{"dataset": "atlas"}

If attributes are set on sub-tables, they’ll be stored in a dot-delimited fashion:

class Detections(qv.Table):
    measure = Measurements.as_column()
    label = qv.IntAttribute()

dets = Detections.from_kwargs(measure=measurements, label=42)

df = dets.to_dataframe()
print(df.attrs)
{"measure.dataset": "atlas", "label": 42}

Alternatively, you can represent attributes with an additional column in the DataFrame. The value will be repeated for every row:

df = measurements.to_dataframe(attr_handling="add_columns")
print(df)
      position.x  position.y  measurer.id measurer.name  dataset
0         4.8         3.1          0.0         Alice      atlas
1         4.9         NaN          1.0           Bob      atlas
2         NaN         3.2          1.0           Bob      atlas
3         8.0         3.2          NaN          None      atlas
4         5.3         3.3          0.0         Alice      atlas

From Pandas#

You can read from Pandas using these methods Table.from_dataframe() and Table.from_flat_dataframe(). Loading from flat dataframes is only needed when loading a Table that contains subtables.

You can specify any attributes in the constructor explicitly when loading from a DataFrame if they are not present:

measurements2 = Measurements.from_flat_dataframe(df, dataset="atlas")

Table.from_dataframe() and Table.from_flat_dataframe() will attempt to infer attribute values if they are not explicitly passed. They will look for columns which match attribute names, and will also check in the dataframe’s attrs property, expecting the same serialization as described above in the previous section.

In addition, Table.from_kwargs() can handle pandas.Series objects as input parameters, so you can do something like this:

measurements3 = Measurements.from_kwargs(
    position=Positions.from_kwargs(
        x=df['position.x'],
        y=df['position.y']
    ),
    measurer=Measurers.from_kwargs(
        id=df['measurer.id'].fillna(0).astype("uint32"),
        name=df['measurer.name'],
    )
    dataset="atlas"
)

Limitations#

Since Pandas Series don’t support null values (but quivr/Arrow arrays do), you’ll see some loss of fidelity when going from quivr into Pandas datastructures and back.

For more information, see the Arrow documentation on the subject.

Arrow#

Arrow is the native backing system for quivr’s tables. All data is stored internally using Arrow arrays and schemas. As a result, data always works with Arrow losslessly.

To Arrow#

The underlying Arrow data behind a Table can be accessed several ways:

  1. Columns can be accessed to get individual Arrow Arrays of data.

  2. The entire backing Arrow table can be accessed.

  3. The quivr table can be reshaped and presented as an Arrow StructArray.

To Arrow Arrays#

For a column named foo on a table instance named tab, tab.foo will get the column’s data and present it as an Arrow array directly:

print(type(positions.x))
# pyarrow.lib.DoubleArray

The mapping of types is comprehensively documented in the API reference for quivr Columns.

To Arrow Tables#

The underlying pyarrow.Table instance can always be accessed on a quivr Table instance through Table.table instance attribute:

print(type(positions.table))
# pyarrow.lib.Table

The pyarrow.Table is a useful structure that can then be used for low-level operations. For example, to make a list of pyarrow.RecordBatch objects which describe the table’s data in batches suitable for serializing and even communicating over a network, you can use Table.table.to_batches():

for batch in positions.table.to_batches():
    send(batch)

The pyarrow.Table holds all of the data associated with the quivr Table, including attributes. Attributes are stored in schema metadata, which is a dictionary with bytes keys and values. Attributes are encoded with their name for the key (with a dot-delimited prefix for attributes on sub-tables) and with a byte-encoding of their value.

print(measurements.table.schema.metadata)
# {b'dataset': b'atlas'}

To Arrow StructArrays#

Tables have a Table.to_structarray() method which can be used to construct a pyarrow.StructArray. This array holds PyArrow pyarrow.struct instances. The struct will have a schema which matches that of the Table:

print(measurements.to_structarray()[0])
# [('position', {'x': 4.8, 'y': 3.1}), ('measurer', {'id': 0, 'name': 'Alice'})]

From Arrow#

You can bring Arrow data into quivr to populate Table instances.

First, you can use Table.from_pyarrow() to populate a quivr Table’s data with the contents of a PyArrow Table. When you do so, the pyarrow.Table’s metadata will be preserved, and will be used to set Table Attribute values, if they’re set on the quivr Table class definition. You can provide them as keyword arguments if they aren’t present in the pyarrow.Table’s schema, as well. If they’re present in both, then the keyword arguments take precedence.

The source pyarrow.Table should have a schema which matches the quivr.Table’s definition. “Matching” is a somewhat unrigorous concept currently, but the schema “matches” if:

  • the source schema has a matching field for all non-nullable fields, and

  • all fields present in the source schema can be cast to the corresponding types in the destination table.

Notably, it is not necessary that the source pyarrow Table have exactly the same fields as the destination quivr Table. Extra fields will be ignored. This means that Tables with a subset of columns defined can be used to project a view into a small number of the columns of a larger PyArrow Table.

In addition to Table.from_pyarrow(), you can pass in pyarrow.Arrays when constructing a Table using Table.from_kwargs(). Any of the keyword arguments’ values can be pyarrow Arrays.

Numpy#

To Numpy#

The Arrays that are retrieved when accessing a quivr Table instance’s columns can be cast as Numpy arrays using the pyarrow.Array.to_numpy() method. See Arrow documentation on Numpy integration for more information.

From Numpy#

Numpy arrays can be passed in to the Table.from_kwargs() constructor.

Parquet#

See Parquet.