Unit-aware data frames with composite, dimensional and ixset-typed.


In this post we’re going to see how we can stitch together a few libraries to make a unit-aware queryable data frame from a CSV using extensible records. By the end of this text, we’ll be able to parse a CSV of data from the periodic table, complete with the correct units, and able to quickly ask questions about our data set using the generated indices.

The libraries we’ll be using are:

  • composite-base for extensible records.
  • dimensional to give each field its correct unit.
  • ixset-typed to generate indices for our data set to query.
  • cassava to parse and print the CSV.
  • vinyl for supplying some extra functions for extensible records.

In three stages we will bring this together

  • 1: Defining a suitable target data type for storing the data frame.
  • 2: Parsing and printing the CSV by
    • 2.1: Proving how to decode and encode each field via its correct unit.
    • 2.2: Proving how to decode and encode each record as a whole.
  • 3: Generating the indices and querying our data for facts.

Here is the dataset we’ll be using. Let’s first take a look at a snippet of the data. We won’t need all the fields, but the CSV parser will safely ignore fields we don’t use. The ones we will be interested in are:

AtomicNumber,AtomicMass,Density,MeltingPoint,BoilingPoint,SpecificHeat
1,1.007,8.99e-5,14.175,20.28,14.304
2,4.002,1.79e-4,,4.22,5.193
3,6.941,0.534,453.85,1615.0,3.582
4,9.012,1.85,1560.15,2742.0,1.825
5,10.811,2.34,2573.15,4200.0,1.026
6,12.011,2.27,3948.15,4300.0,0.709
7,14.007,1.25e-3,63.29,77.36,1.04
8,15.999,1.43e-3,50.5,90.2,0.918
...

Each of these fields has a label and a unit. It also may or may not be there (in the case of Helium, there is no melting point given).

Using extensible records allows us to declare these fields independently, and then cobble together a record from them. First some imports

import Composite.Record
import Composite.TH
import Data.IxSet.Typed as Ix
import Numeric.Units.Dimensional.Prelude as N
import Numeric.Units.Dimensional.NonSI as N

A field declaration in composite looks like:

type FAtomicMass a = "AtomicMass" :-> Mass a

We can do this for all of our CSV fields, also generate a few lenses at the same time.

withLensesAndProxies [d|
  
  type FAtomicNumber = "AtomicNumber" :-> Int
  
  type FAtomicMass a = "AtomicMass" :-> Mass a
  
  type FDensity a = "Density" :-> Density a
  
  type FMeltingPoint a = "MeltingPoint" :-> ThermodynamicTemperature a
  
  type FBoilingPoint a = "BoilingPoint" :-> ThermodynamicTemperature a
  
  type FSpecificHeat a = "SpecificHeat" :-> SpecificHeatCapacity a
  |]

Here in particular some of our fields are parameterised by the number type they take. We’ll ultimately work with Double, but stay polymorphic where we can.

A record in composite is expressed as a type level list of the fields we want, a plain record of these fields would look like this:

-- the order won't matter for our parser
type AtomicFields a = '[FAtomicNumber, FAtomicMass a, FMeltingPoint a, FBoilingPoint a, FSpecificHeat a, FDensity a]
  
type AtomicData a = Record (AtomicFields a)

But recall we still want to express the fact that some of these fields may not exist. Composite has a mechanism for doing that by raising every field over the same Functor, so we can lift our record over the Maybe functor, which is expressed like so:

type AtomicData a = Rec Maybe (AtomicFields a)

So a value of type AtomicData a represents one row of our fields, where each field may or may not contain a value. As an example, here is a row written out manually. In each field of the row, we multiply the value by its appropriate unit, and this will type check:

helium :: AtomicData Double
helium = Just 2
   :^: Just (4.002 N.*~ dalton)
   :^: Nothing
   :^: Just (4.22 N.*~ kelvin)
   :^: Just (5.193 N.*~ ((meter N.* meter) N./ (second N.* second N.* kelvin)))
   :^: Just (1.79e-4 N.*~ (kilo gram N./ litre))
   :^: RNil

With our lenses, we can perform transformations using rlens' from composite. For example, to create a universe where everything is twice as dense:

superdense :: AtomicData Double -> AtomicData Double
superdense = over (rlens' (fAtomicMass_ @Double) . _Just) (N._2 N.*)

We want to decode a whole bunch of rows, so we want something like a list of these things, but instead what we’re going to use is an IxSet, which will allow us to define our indices later. An IxSet ixs xs takes two type parameters, one for the type of indices that we can use to query the set, and one for the concrete data type used to represent each row. Normally with nominal data types we’d have to be quite explicit with the indicies, but we’re fortunate in that we can use the same type level list we used to define the record as for the indices.

type IxAtomicData a = IxSet (AtomicFields a) (AtomicData a)

We’ll discuss generating the indices in section 3, but for now this type gives us a base to work around.

Encoding and Decoding

Encoding and decoding will be done with the cassava library. Each direction requires two typeclasses, FromField/ToField to supply how to codec a field, and FromNamedRecord/ToNamedRecord to supply how to codec the record as a whole.

Fields

We need to somehow derive FromField and ToField for each of our field types. Each contains a Quantity s a, but what we don’t want to do is assume what the unit will be for each Quantity. For example if we derived a FromField and ToField for ThermodynamicTemperature a assuming it was always measured in kelvin, then we’d be unable to consume any data from other data sets where that dimension was measured in celcius, as the instance would already be taken. What we need to do is be able to switch the derivation for each field based on which unit codec we want to apply, and we can do that with some DerivingVia tricks and typeclass metaprogramming.

Let’s make a newtype with a type parameter that will allow us to switch on the unit.

newtype Codec k a = Codec {unCodec :: a}

Here k is a phantom type that will allow us to give some information to the deriving mechanism. We’ll make some token types to represent this:

data Kelvin
  
data Dalton
  
data KGPerLitre
  
data SHCUnit

Now instead of implementing FromField (Quantity b a) directly, we can implement FromField (Codec k (Quantity b a)), and use k as one of the above to switch the implementation.

In order to decode a Quantity, we need to multiply the value we receieve by its unit using the *~ operator, and to encode a Quantity, we need to divide it by its unit to bring out the undecorated number value, using the /~ operator. So we need some way of assigning the unit for each of our type-level tokens. This type family and type class combination will get us there.

type family UnitFor k a :: Type
  
class UnitCodec k a where
  unitCodec :: UnitFor k a

Now we can just assign reducts for each of our tokens to get at the unit.

type instance UnitFor Kelvin a = Unit 'Metric DThermodynamicTemperature a
  
type instance UnitFor Dalton a = Unit 'Metric DMass a
  
type instance UnitFor SHCUnit a = Unit 'NonMetric DSpecificHeatCapacity a
  
type instance UnitFor KGPerLitre a = Unit 'NonMetric DDensity a
  
instance Num a => UnitCodec Kelvin a where
  unitCodec = kelvin
  
instance Floating a => UnitCodec Dalton a where
  unitCodec = dalton
  
instance Floating a => UnitCodec KGPerLitre a where
  unitCodec = kilo gram N./ litre
  
instance Floating a => UnitCodec SHCUnit a where
  unitCodec = (meter N.* meter) N./ (second N.* second N.* kelvin)

Now we can get to and from any Quantity via our Codec.

instance (FromField a, Num a, UnitCodec k a, UnitFor k a ~ Unit c b a) => FromField (Codec k (Quantity b a)) where
  parseField m = Codec . (N.*~ unitCodec @k @a) <$> parseField @a m
  
instance (ToField a, Fractional a, UnitCodec k a, UnitFor k a ~ Unit c b a) => ToField (Codec k (Quantity b a)) where
  toField = toField @a . (N./~ unitCodec @k @a) . unCodec

Now we can derive all our instances in one line each.

deriving via (Codec Dalton (Mass a)) instance (FromField a, Floating a) => FromField (FAtomicMass a)
  
deriving via (Codec Dalton (Mass a)) instance (Floating a, ToField a) => ToField (FAtomicMass a)
  
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (FromField a, Floating a) => FromField (FMeltingPoint a)
   
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (Fractional a, ToField a) => ToField (FMeltingPoint a)
  
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (FromField a, Floating a) => FromField (FBoilingPoint a)
  
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (Fractional a, ToField a) => ToField (FBoilingPoint a)
  
deriving via (Codec KGPerLiter (Density a)) instance (FromField a, Floating a) => FromField (FDensity a)
  
deriving via (Codec KGPerLiter (Density a)) instance (Floating a, ToField a) => ToField (FDensity a)
  
deriving via (Codec SHCUnit (SpecificHeatCapacity a)) instance (FromField a, Floating a) => FromField (FSpecificHeat a)
  
deriving via (Codec SHCUnit (SpecificHeatCapacity a)) instance (Floating a, ToField a) => ToField (FSpecificHeat a)

Record

Implementing the record encoder and decoder as a whole is just a case of inducting over the elements of the record, assuming we have a FromField for each of them. In order to give us some breathing room, we can derive via a newtype as before, which I have for no reason called TF.

newtype TF f xs = TF {unTF :: Rec f xs}
  
instance FromNamedRecord (TF f '[]) where
  parseNamedRecord m = pure $ TF RNil
  
instance ToNamedRecord (TF f '[]) where
  toNamedRecord m = mempty
instance (Functor f, KnownSymbol s, FromField (f (s :-> x)), FromNamedRecord (TF f xs)) => FromNamedRecord (TF f ((s :-> x) ': xs)) where
  parseNamedRecord m = do
    x <- m .: T.encodeUtf8 (valName @s undefined)
    TF f <- parseNamedRecord @(TF f xs) m
    pure $ TF $ x :& f
  
instance (Functor f, KnownSymbol s, ToField (f (s :-> x)), ToNamedRecord (TF f xs)) => ToNamedRecord (TF f ((s :-> x) ': xs)) where
  toNamedRecord (TF (x :& xs)) = HM.singleton (T.encodeUtf8 (valName @s undefined)) (toField x) <> toNamedRecord (TF xs)
deriving via (TF Maybe (AtomicFields Double)) instance ToNamedRecord (AtomicData Double)  
  
deriving via (TF Maybe (AtomicFields Double)) instance FromNamedRecord (AtomicData Double)

We can make a couple of convenience functions to decode to IxSets directly instead of Vectors as cassava does.

decodeByName :: (Ix.Indexable ixs xs, Data.Csv.FromNamedRecord xs) => ByteString -> Either String (Data.Csv.Header, Ix.IxSet ixs xs)
decodeByName = fmap (fmap (Ix.fromList . V.toList)) . Data.Csv.decodeByName
  
encodeByName :: (Ix.Indexable ixs xs, Data.Csv.ToNamedRecord xs) => Data.Csv.Header -> Ix.IxSet ixs xs -> ByteString
encodeByName h = Data.Csv.encodeByName h . Ix.toList

At this point, we can test that our decoder and encoder work as advertised.

main :: IO ()
main = do
  x <- BS.readFile "chem.csv"
  let y = decodeByName x :: Either String (Header, IxAtomicData Double)
  print y
  case y of
    Left x -> error x
    Right (x, xs) -> 
      BS.writeFile "out.csv" $ encodeByName ["AtomicNumber", "AtomicMass", "Density", "MeltingPoint", "BoilingPoint", "SpecificHeat"] xs 

Notice that the Show instances will print out in SI units, using kilograms instead of daltons for the atomic mass.

...
Just AtomicMass :-> 1.6721639813999998e-27 kg
...

However our values in “out.csv” will be correctly printed in daltons.

We’ve needed to supply the header here manually, but we already have our type level list, so can generate that as well with some typeclass programming.

class ToHeader x where
  extractRecHeader :: Proxy x -> Vector Name

instance ToHeader (Rec f '[]) where
  extractRecHeader _ = Data.Vector.fromList []

instance (KnownSymbol s, ToHeader (Rec f xs)) => ToHeader (Rec f (s :-> x ': xs)) where
  extractRecHeader Proxy = pure (encodeUtf8 $ valName @s undefined) <> extractRecHeader (Proxy @(Rec f xs))
    BS.writeFile "out.csv" $ encodeByName (extractRecHeader (Proxy @(AtomicData Double))) xs 

Generating the indices.

In order to use the querying functions from IxSet, we need to supply an Indexable instance for our data. We use the varyadic function ixList to generate one index for each list. Since the types are inferred by the type level list, we can actually use the same function for each slot, using the lens rgetC from Data.Vinyl.Lens. Each indexing function returns a list, for indexing on fields like “tags”, where there are multiple fields. We only have at most one field, so we can just transform to a list. The instance actually looks like this:

instance Ord a => Ix.Indexable (AtomicFields a) (AtomicData a) where
  indices =
    ixList
      (ixFun $ maybeToList . rgetC)
      (ixFun $ maybeToList . rgetC)
      (ixFun $ maybeToList . rgetC)
      (ixFun $ maybeToList . rgetC)
      (ixFun $ maybeToList . rgetC)
      (ixFun $ maybeToList . rgetC)

Now we are finished. Let’s test our indices.

One question I would like to know the answer to is, what elements have a boiling point of less than fifty kelvin. We can encode “A boiling point of fifty kelvin” as a value of FBoilingPoint.

bp50 :: FBoilingPoint Double
bp50 = Val (50 N.*~ unitCodec @Kelvin)

Now we need to use the Ix combinators to filter on Ix.@<= bp50. These operators can be chained, so we can add these indefinitely to produce more complex queries. Our run with the query:

main :: IO ()
main = do
  x <- BS.readFile "chem.csv"
  let y = decodeByName x :: Either String (Header, IxAtomicData Double)
  print y
  case y of
    Left e -> error e
    Right (x, xs) -> do
      let zs = xs Ix.@<= bp50
      BS.writeFile "out.csv" $ encodeByName (extractRecHeader (Proxy @(AtomicData Double))) xs 

And we should get the following result:

AtomicNumber,AtomicMass,Density,MeltingPoint,BoilingPoint,SpecificHeat
1,1.007,8.99e-5,14.175,20.28,14.304
2,4.002,1.79e-4,,4.22,5.193
10,20.18,9.0e-4,24.703,27.07,1.03

Thanks to the library authors of dimensional, composite and ixset-typed for making this stack possible.

You can view the code for this on gitlab.homotopic.tech.

Some connective libraries that fell out of this blog post are here: