Serialization and Deserialization#

quivr Tables can be loaded from Parquet files, Feather files, and CSVs.

Parquet#

Renaming Columns#

When you’re loading a Parquet file, it might not be one you created, so you might not have control over its schema. If it has column names that are not valid Python identifiers, or just if you’d prefer they be something different, you can supply a column name mapping to the deserialization functions (Table.from_parquet()).

For example, here is the 2022 schema for the famous New York City taxi data:

VendorID: int64
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
RatecodeID: double
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
congestion_surcharge: double
airport_fee: double

Some of these are fine, but others are a little disagreeable. Without a column name map, we’d need use matching attribute names in our Table class:

from quivr import *

class TaxiData(Table):
    VendorID = Int64Column()
    tpep_pickup_datetime = TimestampColumn(unit="us")
    tpep_dropoff_datetime = TimestampColumn(unit="us")
    passenger_count = Float64Column()
    trip_distance = Float64Column()
    RatecodeID = Float64Column()
    ...


data = TaxiData.from_parquet("yellow__tripdata_2023-01.parquet")
print(data)
# TaxiData(size=3066766)

But we can use a column name map to make the names more pleasant. We only need to supply the names that are different from the attribute names:

from quivr import *

class TaxiData(Table):
    vendor_id = Int64Column()
    pickup = TimestampColumn(unit="us")
    dropoff = TimestampColumn(unit="us")
    passenger_count = Float64Column()
    trip_distance = Float64Column()
    rate_code = Float64Column()
    ...

column_name_mapping = {
    "VendorID": "vendor_id",
    "tpep_pickup_datetime": "pickup",
    "tpep_dropoff_datetime": "dropoff",
    "RatecodeID": "rate_code",
}

data = TaxiData.from_parquet(
    "yellow__tripdata_2023-01.parquet",
    column_name_map=column_name_mapping,
)
print(data)
# TaxiData(size=3066766)

print(data.pickup[:2])
# [
#    2023-01-01 00:32:10.000000,
#    2023-01-01 00:55:08.000000
# ]

To take it one step further, we could encapsulate this in a method for our TaxiData class:

from quivr import *

class TaxiData(Table):
    vendor_id = Int64Column()
    pickup = TimestampColumn(unit="us")
    dropoff = TimestampColumn(unit="us")
    passenger_count = Float64Column()
    trip_distance = Float64Column()
    rate_code = Float64Column()

    @classmethod
    def from_parquet(cls, path):
        column_name_mapping = {
            "VendorID": "vendor_id",
            "tpep_pickup_datetime": "pickup",
            "tpep_dropoff_datetime": "dropoff",
            "RatecodeID": "rate_code",
        }
        return super().from_parquet(
            path,
            column_name_map=column_name_mapping,
        )

taxi_data = TaxiData.from_parquet("./yellow__tripdata_2023-01.parquet")

Changing Column Types#

Sometimes you might want to change the type of a column. For example, in the preceding example of New York City taxi data, the passenger_count column is a float, but it should be an integer.

You can do this by just using your desired type in the columns for your table.

from quivr import *

class TaxiData(Table):
    vendor_id = UInt8Column()
    pickup = TimestampColumn(unit="us")
    dropoff = TimestampColumn(unit="us")
    passenger_count = UInt8Column()
    trip_distance = Float64Column()
    rate_code = UInt8Column()

    @classmethod
    def from_parquet(cls, path):
        column_name_mapping = {
            "VendorID": "vendor_id",
            "tpep_pickup_datetime": "pickup",
            "tpep_dropoff_datetime": "dropoff",
            "RatecodeID": "rate_code",
        }
        return super().from_parquet(
            path,
            column_name_map=column_name_mapping,
        )

taxi_data = TaxiData.from_parquet("./yellow__tripdata_2023-01.parquet")