Basic Usage#

Using quivr starts with writing some classes which describe the data you’re working with. You write a Table definition using quivr’s Column and Attribute types to describe the data.

Here’s an example, a basic table of X-Y-Z positions:

import quivr as qv

class Positions(qv.Table):
    x = qv.Float64Column()
    y = qv.Float64Column()
    z = qv.Float64Column()

This describes a table with three columns, each holding 64-bit floating point values.

We can construct an instance of this with Python values using Table.from_kwargs:

positions = Positions.from_kwargs(
    x=[1, 2, 3, 4],
    y=[5, 6, 7, 8],
    z=[9, 10, 11, 12],
)
print(positions)
# Positions(size=4)

The positions instance in this case holds three Arrow Arrays of data, each of length 4. You can access them by their field names as defined on the Positions class:

print(positions.x)
# [
#   1,
#   2,
#   3,
#   4
# ]

Arrow Arrays have an extensive API which supports rich computation. Here are a few examples:

# Convert to a Numpy array and use Numpy methods:
print(positions.x.to_numpy().min())
# 1.0

# Use pyarrow.compute:
import pyarrow.compute as pc
print(pc.min(positions.y))
# 5.0

# Multiply one array by another, element-wise
print(pc.multiply(positions.x, positions.y))
# [
#   5,
#   12,
#   21,
#   32
# ]

Many other constructors are available as well to make an instance from pandas DataFrames, Apache Parquet files, and other input sources.

Attaching methods to encapsulate logic#

If you’re familiar with Numpy or Pandas, you have probably written lots of functions for encapsulating logic. For example, let’s say you want to compute distances from these position values - that is, you want to compute sqrt(x**2 + y**2 + z**2).

With a Pandas DataFrame, you might have written it this way:

def distances(position_df: pd.DataFrame) -> pd.Series:
    return (
        position_df["x"] * position_df["x"]
        + position_df["y"] * position_df["y"]
        + position_df["z"] * position_df["z"]
    ).sqrt()

This code works, but it’s tricky. The caller needs to have a dataframe with the correct columns, correctly typed, and tehre is nothing in the function signature that explains this. It’s easy to pass in a malformed DataFrame.

In quivr, this sort of logic is instead encapsulated with a method defined on the Positionss class directly:

import pyarrow.compute as pc

class Positions(qv.Table):
    x = qv.Float64Column()
    y = qv.Float64Column()
    z = qv.Float64Column()

    def distances(self) -> pa.Array:
        xs = pc.multiply(self.x, self.x)
        ys = pc.multiply(self.y, self.y)
        zs = pc.multiply(self.z, self.z)
        sum = pc.add(xs, ys)
        sum = pc.add(sum, zs)
        return pc.sqrt(sum)

That method uses the PyArrow Compute API, which is efficient and precise, but rather verbose. You might prefer to work through Numpy:

import numpy as np

class Positions(qv.Table):
    x = qv.Float64Column()
    y = qv.Float64Column()
    z = qv.Float64Column()

    def distances(self) -> np.array:
        x = self.x.to_numpy()
        y = self.y.to_numpy()
        z = self.z.to_numpy()

        return np.sqrt(x*x + y*y + z*z)

The pyarrow.Array.to_numpy() method is very efficient. In typical cases, it involves no computation or memory copy, so you can feel comfortable using to_numpy liberally. There are a few caveats to be aware of, though:

Adding more columns#

Tables can have more columns with different types. For example, you might want to add a measured_by column to the Positions table to represent the entity that measured the position:

class Positions(qv.Table):
    x = qv.Float64Column()
    y = qv.Float64Column()
    z = qv.Float64Column()
    measured_by = qv.StringColumn()

There are many types of columns. See Columns in the API reference for the full list.

Composition#

A central feature of quivr is Table composition. The idea is that Tables can be composed together into larger entities with sub-tables. This amplifies the benefits of attaching methods to Table classes. These two tools combine to form a powerful language for data-oriented programming.

To use a table compositionally, you use the Table.as_column() class method. For example, let’s remove the measuered_by column from Positions, and instead represent that concept with a richer Measurers Table:

class Measurers(qv.Table):
    id = qv.UInt32Column()
    name = qv.StringColumn()

We can now make a wrapping Table called Measurements which will store both Positions and their Measurers:

class Measurements(qv.Table):
    position = Positions.as_column()
    measurer = Measurers.as_column()
    measured_at = qv.TimestampColumn(unit="s")

This Measurements Table now composes the Positions and Measurers tables into one structure in memory. The underlying data layout is still tabular, and the sub-tables shared common indexes. The following table may help visualize this:

position.x	position.y	position.z	measurer.id	measurer.name	measured_at
4.1	5.0	6.2	0	Enrico Fermi	2018-09-14T16:32:01Z
4.3	4.8	7.1	0	Enrico Fermi	2018-09-14T16:32:08Z
4.2	3.7	7.2	1	Albert Einstein	2018-09-14T16:33:21Z
4.0	6.2	7.3	0	Enrico Fermi	2018-09-14T16:35:22Z
4.5	4.4	7.3	1	Albert Einstein	2018-09-14T16:36:38Z

You could construct a table with that data like this:

measurements = Measurements.from_kwargs(
    position=Positions.from_kwargs(
        x=[4.1, 4.3, 4.2, 4.0, 4.5],
        y=[5.0, 4.8, 3.7, 6.2, 4.4],
        z=[6.2, 7.1, 7.2, 7.3, 7.3],
    ),
    measurer=Measurers.from_kwargs(
        id=[0, 0, 1, 0, 1],
        name=[
            "Enrico Fermi",
            "Enrico Fermi",
            "Albert Einstein",
            "Enrico Fermi",
            "Albert Einstein"
        ]
    ),
    measured_at=[
        datetime.datetime(2018, 9, 14, 16, 32, 1),
        datetime.datetime(2018, 9, 14, 16, 32, 8),
        datetime.datetime(2018, 9, 14, 16, 33, 21),
        datetime.datetime(2018, 9, 14, 16, 35, 22),
        datetime.datetime(2018, 9, 14, 16, 36, 38),
    ]
)

You can access the subtables with normal Python dot-style notation, You’ll get an instance of the Table class you defined, as you might expect, which means you can call any attached methods:

print(measurements.position)
# Positions(size=5)

print(measurements.position.distances())
# [ 8.95823643  9.58853482  9.11975877 10.37930634  9.63846461]

And of course, the wrapping class can have methods. This can let you build sophisticated computations while managing complexity:

class Measurements(qv.Table):
    position = Positions.as_column()
    measurer = Measurers.as_column()
    measured_at = qv.TimestampColumn(unit="s")

    def max_distance_by_measurer(self):
        maxes = {}
        unique_ids = self.measurer.id.unique().to_numpy()
        for id in unique_ids:
            # Mask with 'true' for every row where measurer.id = id
            mask = pc.equal(self.measurer.id, id)

            # This makes a view into the data using the given mask
            # as a filter:
            positions = self.position.apply_mask(mask)
            maxes[id] = positions.distances().max()

        return maxes

There are a lot more features to quivr that you can use to manage your data, but you already know the most important ones. To summarize:

Define a Table class using Columns which describes your data.
Attach methods to the Table class to describe your computations
Use Table composition to manage complexity

What to look at next#

Some of the more advanced features you might be interested in include:

Attributes for attaching scalar (non-tabular) data to Tables
Linkages to represent relationships between tables
Validators to validate that data matches conditions
Serialization for working with data in Parquet and other formats
Handling Nulls to be safe when dealing with missing values