In this post we’re going to see how we can stitch together a few libraries to make a unit-aware queryable data frame from a CSV using extensible records. By the end of this text, we’ll be able to parse a CSV of data from the periodic table, complete with the correct units, and able to quickly ask questions about our data set using the generated indices.
The libraries we’ll be using are:
In three stages we will bring this together
Here is the dataset we’ll be using. Let’s first take a look at a snippet of the data. We won’t need all the fields, but the CSV parser will safely ignore fields we don’t use. The ones we will be interested in are:
AtomicNumber,AtomicMass,Density,MeltingPoint,BoilingPoint,SpecificHeat
1,1.007,8.99e-5,14.175,20.28,14.304
2,4.002,1.79e-4,,4.22,5.193
3,6.941,0.534,453.85,1615.0,3.582
4,9.012,1.85,1560.15,2742.0,1.825
5,10.811,2.34,2573.15,4200.0,1.026
6,12.011,2.27,3948.15,4300.0,0.709
7,14.007,1.25e-3,63.29,77.36,1.04
8,15.999,1.43e-3,50.5,90.2,0.918
...
Each of these fields has a label and a unit. It also may or may not be there (in the case of Helium, there is no melting point given).
Using extensible records allows us to declare these fields independently, and then cobble together a record from them. First some imports
import Composite.Record
import Composite.TH
import Data.IxSet.Typed as Ix
import Numeric.Units.Dimensional.Prelude as N
import Numeric.Units.Dimensional.NonSI as N
A field declaration in composite looks like:
type FAtomicMass a = "AtomicMass" :-> Mass a
We can do this for all of our CSV fields, also generate a few lenses at the same time.
withLensesAndProxies [d|
type FAtomicNumber = "AtomicNumber" :-> Int
type FAtomicMass a = "AtomicMass" :-> Mass a
type FDensity a = "Density" :-> Density a
type FMeltingPoint a = "MeltingPoint" :-> ThermodynamicTemperature a
type FBoilingPoint a = "BoilingPoint" :-> ThermodynamicTemperature a
type FSpecificHeat a = "SpecificHeat" :-> SpecificHeatCapacity a |]
Here in particular some of our fields are parameterised by the number type they take. We’ll ultimately work with Double, but stay polymorphic where we can.
A record in composite is expressed as a type level list of the fields we want, a plain record of these fields would look like this:
-- the order won't matter for our parser
type AtomicFields a = '[FAtomicNumber, FAtomicMass a, FMeltingPoint a, FBoilingPoint a, FSpecificHeat a, FDensity a]
type AtomicData a = Record (AtomicFields a)
But recall we still want to express the fact that some of these fields may not exist. Composite has a mechanism for doing that by raising every field over the same Functor, so we can lift our record over the Maybe functor, which is expressed like so:
type AtomicData a = Rec Maybe (AtomicFields a)
So a value of type AtomicData a
represents one row of
our fields, where each field may or may not contain a value. As an
example, here is a row written out manually. In each field of the row,
we multiply the value by its appropriate unit, and this will type
check:
helium :: AtomicData Double
= Just 2
helium :^: Just (4.002 N.*~ dalton)
:^: Nothing
:^: Just (4.22 N.*~ kelvin)
:^: Just (5.193 N.*~ ((meter N.* meter) N./ (second N.* second N.* kelvin)))
:^: Just (1.79e-4 N.*~ (kilo gram N./ litre))
:^: RNil
With our lenses, we can perform transformations using
rlens'
from composite. For example, to create a universe
where everything is twice as dense:
superdense :: AtomicData Double -> AtomicData Double
= over (rlens' (fAtomicMass_ @Double) . _Just) (N._2 N.*) superdense
We want to decode a whole bunch of rows, so we want something like a
list of these things, but instead what we’re going to use is an IxSet,
which will allow us to define our indices later. An
IxSet ixs xs
takes two type parameters, one for the type of
indices that we can use to query the set, and one for the concrete data
type used to represent each row. Normally with nominal data types we’d
have to be quite explicit with the indicies, but we’re fortunate in that
we can use the same type level list we used to define the record as for
the indices.
type IxAtomicData a = IxSet (AtomicFields a) (AtomicData a)
We’ll discuss generating the indices in section 3, but for now this type gives us a base to work around.
Encoding and decoding will be done with the cassava library. Each
direction requires two typeclasses, FromField/ToField
to
supply how to codec a field, and
FromNamedRecord/ToNamedRecord
to supply how to codec the
record as a whole.
We need to somehow derive FromField
and
ToField
for each of our field types. Each contains a
Quantity s a
, but what we don’t want to do is assume what
the unit will be for each Quantity. For example if we derived a
FromField
and ToField
for
ThermodynamicTemperature a
assuming it was always measured
in kelvin, then we’d be unable to consume any data from other data sets
where that dimension was measured in celcius, as the instance would
already be taken. What we need to do is be able to switch the derivation
for each field based on which unit codec we want to apply, and we can do
that with some DerivingVia tricks and typeclass metaprogramming.
Let’s make a newtype with a type parameter that will allow us to switch on the unit.
newtype Codec k a = Codec {unCodec :: a}
Here k is a phantom type that will allow us to give some information to the deriving mechanism. We’ll make some token types to represent this:
data Kelvin
data Dalton
data KGPerLitre
data SHCUnit
Now instead of implementing FromField (Quantity b a)
directly, we can implement
FromField (Codec k (Quantity b a))
, and use k as one of the
above to switch the implementation.
In order to decode a Quantity, we need to multiply the value we
receieve by its unit using the *~
operator, and to encode a
Quantity, we need to divide it by its unit to bring out the undecorated
number value, using the /~
operator. So we need some way of
assigning the unit for each of our type-level tokens. This type family
and type class combination will get us there.
type family UnitFor k a :: Type
class UnitCodec k a where
unitCodec :: UnitFor k a
Now we can just assign reducts for each of our tokens to get at the unit.
type instance UnitFor Kelvin a = Unit 'Metric DThermodynamicTemperature a
type instance UnitFor Dalton a = Unit 'Metric DMass a
type instance UnitFor SHCUnit a = Unit 'NonMetric DSpecificHeatCapacity a
type instance UnitFor KGPerLitre a = Unit 'NonMetric DDensity a
instance Num a => UnitCodec Kelvin a where
= kelvin
unitCodec
instance Floating a => UnitCodec Dalton a where
= dalton
unitCodec
instance Floating a => UnitCodec KGPerLitre a where
= kilo gram N./ litre
unitCodec
instance Floating a => UnitCodec SHCUnit a where
= (meter N.* meter) N./ (second N.* second N.* kelvin) unitCodec
Now we can get to and from any Quantity via our Codec.
instance (FromField a, Num a, UnitCodec k a, UnitFor k a ~ Unit c b a) => FromField (Codec k (Quantity b a)) where
= Codec . (N.*~ unitCodec @k @a) <$> parseField @a m
parseField m
instance (ToField a, Fractional a, UnitCodec k a, UnitFor k a ~ Unit c b a) => ToField (Codec k (Quantity b a)) where
= toField @a . (N./~ unitCodec @k @a) . unCodec toField
Now we can derive all our instances in one line each.
deriving via (Codec Dalton (Mass a)) instance (FromField a, Floating a) => FromField (FAtomicMass a)
deriving via (Codec Dalton (Mass a)) instance (Floating a, ToField a) => ToField (FAtomicMass a)
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (FromField a, Floating a) => FromField (FMeltingPoint a)
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (Fractional a, ToField a) => ToField (FMeltingPoint a)
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (FromField a, Floating a) => FromField (FBoilingPoint a)
deriving via (Codec Kelvin (ThermodynamicTemperature a)) instance (Fractional a, ToField a) => ToField (FBoilingPoint a)
deriving via (Codec KGPerLiter (Density a)) instance (FromField a, Floating a) => FromField (FDensity a)
deriving via (Codec KGPerLiter (Density a)) instance (Floating a, ToField a) => ToField (FDensity a)
deriving via (Codec SHCUnit (SpecificHeatCapacity a)) instance (FromField a, Floating a) => FromField (FSpecificHeat a)
deriving via (Codec SHCUnit (SpecificHeatCapacity a)) instance (Floating a, ToField a) => ToField (FSpecificHeat a)
Implementing the record encoder and decoder as a whole is just a case
of inducting over the elements of the record, assuming we have a
FromField
for each of them. In order to give us some
breathing room, we can derive via a newtype as before, which I have for
no reason called TF
.
newtype TF f xs = TF {unTF :: Rec f xs}
instance FromNamedRecord (TF f '[]) where
= pure $ TF RNil
parseNamedRecord m
instance ToNamedRecord (TF f '[]) where
= mempty toNamedRecord m
instance (Functor f, KnownSymbol s, FromField (f (s :-> x)), FromNamedRecord (TF f xs)) => FromNamedRecord (TF f ((s :-> x) ': xs)) where
= do
parseNamedRecord m <- m .: T.encodeUtf8 (valName @s undefined)
x TF f <- parseNamedRecord @(TF f xs) m
pure $ TF $ x :& f
instance (Functor f, KnownSymbol s, ToField (f (s :-> x)), ToNamedRecord (TF f xs)) => ToNamedRecord (TF f ((s :-> x) ': xs)) where
TF (x :& xs)) = HM.singleton (T.encodeUtf8 (valName @s undefined)) (toField x) <> toNamedRecord (TF xs) toNamedRecord (
deriving via (TF Maybe (AtomicFields Double)) instance ToNamedRecord (AtomicData Double)
deriving via (TF Maybe (AtomicFields Double)) instance FromNamedRecord (AtomicData Double)
We can make a couple of convenience functions to decode to IxSets directly instead of Vectors as cassava does.
decodeByName :: (Ix.Indexable ixs xs, Data.Csv.FromNamedRecord xs) => ByteString -> Either String (Data.Csv.Header, Ix.IxSet ixs xs)
= fmap (fmap (Ix.fromList . V.toList)) . Data.Csv.decodeByName
decodeByName
encodeByName :: (Ix.Indexable ixs xs, Data.Csv.ToNamedRecord xs) => Data.Csv.Header -> Ix.IxSet ixs xs -> ByteString
= Data.Csv.encodeByName h . Ix.toList encodeByName h
At this point, we can test that our decoder and encoder work as advertised.
main :: IO ()
= do
main <- BS.readFile "chem.csv"
x let y = decodeByName x :: Either String (Header, IxAtomicData Double)
print y
case y of
Left x -> error x
Right (x, xs) ->
"out.csv" $ encodeByName ["AtomicNumber", "AtomicMass", "Density", "MeltingPoint", "BoilingPoint", "SpecificHeat"] xs BS.writeFile
Notice that the Show instances will print out in SI units, using kilograms instead of daltons for the atomic mass.
...
Just AtomicMass :-> 1.6721639813999998e-27 kg
...
However our values in “out.csv” will be correctly printed in daltons.
We’ve needed to supply the header here manually, but we already have our type level list, so can generate that as well with some typeclass programming.
class ToHeader x where
extractRecHeader :: Proxy x -> Vector Name
instance ToHeader (Rec f '[]) where
= Data.Vector.fromList []
extractRecHeader _
instance (KnownSymbol s, ToHeader (Rec f xs)) => ToHeader (Rec f (s :-> x ': xs)) where
Proxy = pure (encodeUtf8 $ valName @s undefined) <> extractRecHeader (Proxy @(Rec f xs)) extractRecHeader
"out.csv" $ encodeByName (extractRecHeader (Proxy @(AtomicData Double))) xs BS.writeFile
In order to use the querying functions from IxSet, we need to supply
an Indexable
instance for our data. We use the varyadic
function ixList to generate one index for each list. Since the types are
inferred by the type level list, we can actually use the same function
for each slot, using the lens rgetC
from
Data.Vinyl.Lens
. Each indexing function returns a list, for
indexing on fields like “tags”, where there are multiple fields. We only
have at most one field, so we can just transform to a list. The instance
actually looks like this:
instance Ord a => Ix.Indexable (AtomicFields a) (AtomicData a) where
=
indices
ixList$ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC)
(ixFun $ maybeToList . rgetC) (ixFun
Now we are finished. Let’s test our indices.
One question I would like to know the answer to is, what elements have a boiling point of less than fifty kelvin. We can encode “A boiling point of fifty kelvin” as a value of FBoilingPoint.
bp50 :: FBoilingPoint Double
= Val (50 N.*~ unitCodec @Kelvin) bp50
Now we need to use the Ix combinators to filter on
Ix.@<= bp50
. These operators can be chained, so we can
add these indefinitely to produce more complex queries. Our run with the
query:
main :: IO ()
= do
main <- BS.readFile "chem.csv"
x let y = decodeByName x :: Either String (Header, IxAtomicData Double)
print y
case y of
Left e -> error e
Right (x, xs) -> do
let zs = xs Ix.@<= bp50
"out.csv" $ encodeByName (extractRecHeader (Proxy @(AtomicData Double))) xs BS.writeFile
And we should get the following result:
AtomicNumber,AtomicMass,Density,MeltingPoint,BoilingPoint,SpecificHeat
1,1.007,8.99e-5,14.175,20.28,14.304
2,4.002,1.79e-4,,4.22,5.193
10,20.18,9.0e-4,24.703,27.07,1.03
Thanks to the library authors of dimensional, composite and ixset-typed for making this stack possible.
You can view the code for this on gitlab.homotopic.tech.
Some connective libraries that fell out of this blog post are here: