Unit-aware data frames with composite, dimensional and ixset-typed.

In this post we’re going to see how we can stitch together a few libraries to make a unit-aware queryable data frame from a CSV using extensible records. By the end of this text, we’ll be able to parse a CSV of data from the periodic table, complete with the correct units, and able to quickly ask questions about our data set using the generated indices.

The libraries we’ll be using are:

- composite-base for extensible records.
- dimensional to give each field its correct unit.
- ixset-typed to generate indices for our data set to query.
- cassava to parse and print the CSV.
- vinyl for supplying some extra functions for extensible records.

In three stages we will bring this together

- 1: Defining a suitable target data type for storing the data frame.
- 2: Parsing and printing the CSV by
- 2.1: Proving how to decode and encode each field via its correct unit.
- 2.2: Proving how to decode and encode each record as a whole.

- 3: Generating the indices and querying our data for facts.

Here is the dataset we’ll be using. Let’s first take a look at a snippet of the data. We won’t need all the fields, but the CSV parser will safely ignore fields we don’t use. The ones we will be interested in are:

```
AtomicNumber,AtomicMass,Density,MeltingPoint,BoilingPoint,SpecificHeat
1,1.007,8.99e-5,14.175,20.28,14.304
2,4.002,1.79e-4,,4.22,5.193
3,6.941,0.534,453.85,1615.0,3.582
4,9.012,1.85,1560.15,2742.0,1.825
5,10.811,2.34,2573.15,4200.0,1.026
6,12.011,2.27,3948.15,4300.0,0.709
7,14.007,1.25e-3,63.29,77.36,1.04
8,15.999,1.43e-3,50.5,90.2,0.918
...
```

Each of these fields has a label and a unit. It also may or may not be there (in the case of Helium, there is no melting point given).

Using extensible records allows us to declare these fields independently, and then cobble together a record from them. First some imports

```
import Composite.Record
import Composite.TH
import Data.IxSet.Typed as Ix
import Numeric.Units.Dimensional.Prelude as N
import Numeric.Units.Dimensional.NonSI as N
```

A field declaration in composite looks like:

`type FAtomicMass a = "AtomicMass" :-> Mass a`

We can do this for all of our CSV fields, also generate a few lenses at the same time.

```
withLensesAndProxies [d|
type FAtomicNumber = "AtomicNumber" :-> Int
type FAtomicMass a = "AtomicMass" :-> Mass a
type FDensity a = "Density" :-> Density a
type FMeltingPoint a = "MeltingPoint" :-> ThermodynamicTemperature a
type FBoilingPoint a = "BoilingPoint" :-> ThermodynamicTemperature a
type FSpecificHeat a = "SpecificHeat" :-> SpecificHeatCapacity a |]
```

Here in particular some of our fields are parameterised by the number type they take. We’ll ultimately work with Double, but stay polymorphic where we can.

A record in composite is expressed as a type level list of the fields we want, a plain record of these fields would look like this:

```
-- the order won't matter for our parser
type AtomicFields a = '[FAtomicNumber, FAtomicMass a, FMeltingPoint a, FBoilingPoint a, FSpecificHeat a, FDensity a]
type AtomicData a = Record (AtomicFields a)
```

But recall we still want to express the fact that some of these fields may not exist. Composite has a mechanism for doing that by raising every field over the same Functor, so we can lift our record over the Maybe functor, which is expressed like so:

`type AtomicData a = Rec Maybe (AtomicFields a)`

So a value of type `AtomicData a`

represents one row of
our fields, where each field may or may not contain a value. As an
example, here is a row written out manually. In each field of the row,
we multiply the value by its appropriate unit, and this will type
check:

```
helium :: AtomicData Double
= Just 2
helium :^: Just (4.002 N.*~ dalton)
:^: Nothing
:^: Just (4.22 N.*~ kelvin)
:^: Just (5.193 N.*~ ((meter N.* meter) N./ (second N.* second N.* kelvin)))
:^: Just (1.79e-4 N.*~ (kilo gram N./ litre))
:^: RNil
```

With our lenses, we can perform transformations using
`rlens'`

from composite. For example, to create a universe
where everything is twice as dense:

```
superdense :: AtomicData Double -> AtomicData Double
= over (rlens' (fAtomicMass_ @Double) . _Just) (N._2 N.*) superdense
```

We want to decode a whole bunch of rows, so we want something like a
list of these things, but instead what we’re going to use is an IxSet,
which will allow us to define our indices later. An
`IxSet ixs xs`

takes two type parameters, one for the type of
indices that we can use to query the set, and one for the concrete data
type used to represent each row. Normally with nominal data types we’d
have to be quite explicit with the indicies, but we’re fortunate in that
we can use the same type level list we used to define the record as for
the indices.

`type IxAtomicData a = IxSet (AtomicFields a) (AtomicData a)`

We’ll discuss generating the indices in section 3, but for now this type gives us a base to work around.

Encoding and decoding will be done with the cassava library. Each
direction requires two typeclasses, `FromField/ToField`

to
supply how to codec a field, and
`FromNamedRecord/ToNamedRecord`

to supply how to codec the
record as a whole.

We need to somehow derive `FromField`

and
`ToField`

for each of our field types. Each contains a
`Quantity s a`

, but what we don’t want to do is assume what
the unit will be for each Quantity. For example if we derived a
`FromField`

and `ToField`

for
`ThermodynamicTemperature a`

assuming it was always measured
in kelvin, then we’d be unable to consume any data from other data sets
where that dimension was measured in celcius, as the instance would
already be taken. What we need to do is be able to switch the derivation
for each field based on which unit codec we want to apply, and we can do
that with some DerivingVia tricks and typeclass metaprogramming.

Let’s make a newtype with a type parameter that will allow us to switch on the unit.

`newtype Codec k a = Codec {unCodec :: a}`

Here k is a phantom type that will allow us to give some information to the deriving mechanism. We’ll make some token types to represent this:

```
data Kelvin
data Dalton
data KGPerLitre
data SHCUnit
```

Now instead of implementing `FromField (Quantity b a)`

directly, we can implement
`FromField (Codec k (Quantity b a))`

, and use k as one of the
above to switch the implementation.

In order to decode a Quantity, we need to multiply the value we
receieve by its unit using the `*~`

operator, and to encode a
Quantity, we need to divide it by its unit to bring out the undecorated
number value, using the `/~`

operator. So we need some way of
assigning the unit for each of our type-level tokens. This type family
and type class combination will get us there.

```
type family UnitFor k a :: Type
class UnitCodec k a where
unitCodec :: UnitFor k a
```

Now we can just assign reducts for each of our tokens to get at the unit.

```
type instance UnitFor Kelvin a = Unit 'Metric DThermodynamicTemperature a
type instance UnitFor Dalton a = Unit 'Metric DMass a
type instance UnitFor SHCUnit a = Unit 'NonMetric DSpecificHeatCapacity a
type instance UnitFor KGPerLitre a = Unit 'NonMetric DDensity a
instance Num a => UnitCodec Kelvin a where
= kelvin
unitCodec
instance Floating a => UnitCodec Dalton a where
= dalton
unitCodec
instance Floating a => UnitCodec KGPerLitre a where
= kilo gram N./ litre
unitCodec
instance Floating a => UnitCodec SHCUnit a where
= (meter N.* meter) N./ (second N.* second N.* kelvin) unitCodec
```

Now we can get to and from any Quantity via our Codec.

```
instance (FromField a, Num a, UnitCodec k a, UnitFor k a ~ Unit c b a) => FromField (Codec k (Quantity b a)) where
= Codec . (N.*~ unitCodec @k @a) <$> parseField @a m
parseField m
instance (ToField a, Fractional a, UnitCodec k a, UnitFor k a ~ Unit c b a) => ToField (Codec k (Quantity b a)) where
= toField @a . (N./~ unitCodec @k @a) . unCodec toField
```

Now we can derive all our instances in one line each.

```
deriving via (Codec Dalton (Mass a)) instance (FromField a, Floating a) => FromField (FAtomicMass a)
deriving via (Codec Dalton (Mass a)) instance (Floating a, ToField a) => ToField (FAtomicMass a)
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (FromField a, Floating a) => FromField (FMeltingPoint a)
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (Fractional a, ToField a) => ToField (FMeltingPoint a)
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (FromField a, Floating a) => FromField (FBoilingPoint a)
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (Fractional a, ToField a) => ToField (FBoilingPoint a)
deriving via (Codec KGPerLiter (Density a)) instance (FromField a, Floating a) => FromField (FDensity a)
deriving via (Codec KGPerLiter (Density a)) instance (Floating a, ToField a) => ToField (FDensity a)
deriving via (Codec SHCUnit (SpecificHeatCapacity a)) instance (FromField a, Floating a) => FromField (FSpecificHeat a)
deriving via (Codec SHCUnit (SpecificHeatCapacity a)) instance (Floating a, ToField a) => ToField (FSpecificHeat a)
```

Implementing the record encoder and decoder as a whole is just a case
of inducting over the elements of the record, assuming we have a
`FromField`

for each of them. In order to give us some
breathing room, we can derive via a newtype as before, which I have for
no reason called `TF`

.

```
newtype TF f xs = TF {unTF :: Rec f xs}
instance FromNamedRecord (TF f '[]) where
= pure $ TF RNil
parseNamedRecord m
instance ToNamedRecord (TF f '[]) where
= mempty toNamedRecord m
```

```
instance (Functor f, KnownSymbol s, FromField (f (s :-> x)), FromNamedRecord (TF f xs)) => FromNamedRecord (TF f ((s :-> x) ': xs)) where
= do
parseNamedRecord m <- m .: T.encodeUtf8 (valName @s undefined)
x TF f <- parseNamedRecord @(TF f xs) m
pure $ TF $ x :& f
instance (Functor f, KnownSymbol s, ToField (f (s :-> x)), ToNamedRecord (TF f xs)) => ToNamedRecord (TF f ((s :-> x) ': xs)) where
TF (x :& xs)) = HM.singleton (T.encodeUtf8 (valName @s undefined)) (toField x) <> toNamedRecord (TF xs) toNamedRecord (
```

```
deriving via (TF Maybe (AtomicFields Double)) instance ToNamedRecord (AtomicData Double)
deriving via (TF Maybe (AtomicFields Double)) instance FromNamedRecord (AtomicData Double)
```

We can make a couple of convenience functions to decode to IxSets directly instead of Vectors as cassava does.

```
decodeByName :: (Ix.Indexable ixs xs, Data.Csv.FromNamedRecord xs) => ByteString -> Either String (Data.Csv.Header, Ix.IxSet ixs xs)
= fmap (fmap (Ix.fromList . V.toList)) . Data.Csv.decodeByName
decodeByName
encodeByName :: (Ix.Indexable ixs xs, Data.Csv.ToNamedRecord xs) => Data.Csv.Header -> Ix.IxSet ixs xs -> ByteString
= Data.Csv.encodeByName h . Ix.toList encodeByName h
```

At this point, we can test that our decoder and encoder work as advertised.

```
main :: IO ()
= do
main <- BS.readFile "chem.csv"
x let y = decodeByName x :: Either String (Header, IxAtomicData Double)
print y
case y of
Left x -> error x
Right (x, xs) ->
"out.csv" $ encodeByName ["AtomicNumber", "AtomicMass", "Density", "MeltingPoint", "BoilingPoint", "SpecificHeat"] xs BS.writeFile
```

Notice that the Show instances will print out in SI units, using kilograms instead of daltons for the atomic mass.

```
...
Just AtomicMass :-> 1.6721639813999998e-27 kg
...
```

However our values in “out.csv” will be correctly printed in daltons.

We’ve needed to supply the header here manually, but we already have our type level list, so can generate that as well with some typeclass programming.

```
class ToHeader x where
extractRecHeader :: Proxy x -> Vector Name
instance ToHeader (Rec f '[]) where
= Data.Vector.fromList []
extractRecHeader _
instance (KnownSymbol s, ToHeader (Rec f xs)) => ToHeader (Rec f (s :-> x ': xs)) where
Proxy = pure (encodeUtf8 $ valName @s undefined) <> extractRecHeader (Proxy @(Rec f xs)) extractRecHeader
```

`"out.csv" $ encodeByName (extractRecHeader (Proxy @(AtomicData Double))) xs BS.writeFile `

In order to use the querying functions from IxSet, we need to supply
an `Indexable`

instance for our data. We use the varyadic
function ixList to generate one index for each list. Since the types are
inferred by the type level list, we can actually use the same function
for each slot, using the lens `rgetC`

from
`Data.Vinyl.Lens`

. Each indexing function returns a list, for
indexing on fields like “tags”, where there are multiple fields. We only
have at most one field, so we can just transform to a list. The instance
actually looks like this:

```
instance Ord a => Ix.Indexable (AtomicFields a) (AtomicData a) where
=
indices
ixList$ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC) (ixFun
```

Now we are finished. Let’s test our indices.

One question I would like to know the answer to is, what elements have a boiling point of less than fifty kelvin. We can encode “A boiling point of fifty kelvin” as a value of FBoilingPoint.

```
bp50 :: FBoilingPoint Double
= Val (50 N.*~ unitCodec @Kelvin) bp50
```

Now we need to use the Ix combinators to filter on
`Ix.@<= bp50`

. These operators can be chained, so we can
add these indefinitely to produce more complex queries. Our run with the
query:

```
main :: IO ()
= do
main <- BS.readFile "chem.csv"
x let y = decodeByName x :: Either String (Header, IxAtomicData Double)
print y
case y of
Left e -> error e
Right (x, xs) -> do
let zs = xs Ix.@<= bp50
"out.csv" $ encodeByName (extractRecHeader (Proxy @(AtomicData Double))) xs BS.writeFile
```

And we should get the following result:

```
AtomicNumber,AtomicMass,Density,MeltingPoint,BoilingPoint,SpecificHeat
1,1.007,8.99e-5,14.175,20.28,14.304
2,4.002,1.79e-4,,4.22,5.193
10,20.18,9.0e-4,24.703,27.07,1.03
```

Thanks to the library authors of dimensional, composite and ixset-typed for making this stack possible.

You can view the code for this on gitlab.homotopic.tech.

Some connective libraries that fell out of this blog post are here:

Contact Us

To get in touch, use any of the contact details below.

Email: dan.firth@homotopic.tech

Phone: +447853047347