I recently gave a talk at FP-Syd on my neural network library Grenade. In some sense, I did this because on a previous evening there, after mentioning that I was writing one, was told that “everybody has a neural network library”. I felt that I should present what makes Grenade different from other ANN libraries out there.
Grenade is a dependently typed, high level, artificial neural network library written in Haskell. It’s pretty robust, fast, and expressive, and uses some interesting type level programming techniques which don’t appear to have been used in neural networks before.
I have found that its purely functional nature, and type level features, make development and composition of neural networks fast and easy, indeed, I was able to write my first generative adversarial network based on high level descriptions in a few hours, using no other libraries for reference.
As a very simple example, I was able to write a convolutional neural network for the classic MNIST challenge in just a few lines of code with
type MNIST =
Network
Convolution 1 10 5 5 1 1, Pooling 2 2 2 2, Relu,
'[ Convolution 10 16 5 5 1 1, Pooling 2 2 2 2, Relu,
Flatten, FullyConnected 256 80, Logit,
FullyConnected 80 10, Logit ]
'D2 28 28,
'[ 'D3 24 24 10, 'D3 12 12 10, 'D3 12 12 10,
'D3 8 8 16, 'D3 4 4 16, 'D3 4 4 16,
'D1 256, 'D1 80, 'D1 80,
'D1 10, 'D1 10 ]
randomMnist :: MonadRandom m => m MNIST
= randomNetwork randomMnist
MNIST
is a type alias for our neural network. This type contains not only a
type level list of the layers of the network, but also type level list of the
shapes of data which are passed between these layers.
What’s really interesting here is that the function to construct this network
with random weights is one function call to the function randomNetwork
.
That is, our types are so rich that no specific term level code is required to
create this network.
Deep learning is a big field. Today speech is understood and generated with LSTM networks, language is translated, images are classified, and videos captioned, and faces are recognised.
The classic picture of a stack of fully connected neurons with a non-linear activation function does not cut it these days, and simple networks using this structure are actually probably better replaced with traditional machine learning techniques such as gradient boosted trees, SVMs, or even linear learners.
Neural Networks today are composed of Layers, arranged into a directed graph. Usually, these are acyclic graphs, but recurrent networks may indeed involve cyclic components as well. It is the composition of these layers which give modern neural nets their expressiveness.
The variety of layers is broad, and the Caffe libraries offers over 50 in its zoo.
Many layers have learnable parameters. Convolutional layers for instance learn kernel filters. They are trained by calculating the partial derivatives for the output loss function for every component to train using reverse automatic differentiation. An optimiser is used to assess how much to change the weights in the direction of the gradient, and make the network more likely to give the correct answer.
Automatic differentiation is really cool, but relatively unknown, it’s
It’s its own beast, and works by carrying along partial derivatives at each stage of a complex computation.
A forward mode implementation is relatively trivial, and has been described elsewhere but is simple enough to write here as well.
module AD where
data D s a = D a a deriving (Eq , Show)
lift :: Num a => a -> D s a
= D x 0
lift x
dependent :: Num a => a -> D s a
= D x 1
dependent x
instance Ord a => Ord (D s a) where
compare (D x _) (D y _) = compare x y
instance Num a => Num (D s a) where
D x x' + D y y' = D (x + y) (x' + y')
D x x' * D y y' = D (x * y) (x' * y + x * y')
negate (D x x') = D (negate x) (negate x')
abs (D x x') = D (abs x) (signum x * x')
signum (D x _) = lift (signum x)
fromInteger x = lift (fromInteger x)
instance Floating a => Floating (D s a) where
pi = D pi 0
exp (D x x') = D (exp x) (exp x * x')
log (D x x') = D (log x) (x' / x)
sin (D x x') = D (sin x) (cos x * x')
cos (D x x') = D (cos x) (-sin x * x')
asin (D x x') = D (asin x) (x' / sqrt( 1- x^2))
acos (D x x') = D (acos x) (-x' / sqrt( 1- x^2))
atan (D x x') = D (atan x) (x' / (1 + x^2))
sinh (D x x') = D (sinh x) (x' * cosh x)
cosh (D x x') = D (cosh x) (-sinh x * x')
asinh (D x x') = D (asinh x) (x' / sqrt(1 + x^2))
acosh (D x x') = D (acosh x) (x' / (sqrt(x -1) * sqrt(x+1)))
atanh (D x x') = D (atanh x) (x' / (1 - x^2))
instance Fractional a => Fractional (D s a) where
recip (D x x') = D (recip x) (-x'/x/x)
fromRational x = lift (fromRational x)
-- Numerical differentiation (for tests)
numerical :: (Fractional a, Num a) => (a -> a) -> a -> a
=
numerical f x let bigger = f (x + 0.000005)
= f (x - 0.000005)
smaller in (bigger - smaller) / 0.00001
The main data type D s a = D a a
carries its value in the first position
and its gradient in the second position. The s
type is a phantom type to
ensure we can’t mix up gradients for different calculations, in a similar
way to how runST
has a phantom type to ensure refs can’t be mixed between
mutable calculations.
The idea is that we can create a variable with its derivative using
dependent
. This variable has a derivative of 1, since, if we change the
value, the value will change by the same amount :). When we apply a function
to this variable, its new derivative is calculated as well.
What’s nice about this, is that we can calculate the partial derivative of any calculation in a single pass.
Reverse mode is a little bit more complex to write in Haskell, but is also implement in the ad library.
The idea is to keep a list (called the Wengert Tape) of values entering every step of the computaion, then, once the gradient we want to back propagate is calculated, we can work backwards through the calculation, calculating the new partial derivatives at every point.
Again, what’s nice is that we can calculate the partial derivative for every weight and learnable parameter in the system using a single fowards and backwards pass. Whereas if we were to use numerical differentiation, we’d need to run two calculations for every partial derivative we wanted to calculate. As neural nets often have over a million learnable parameters (with some well over a billion), this is an obvious advantage.
It must be fast. Some models can take days or more to train, so efficiency is
important. In Haskell, this unfortunately means that the ad library can’t be used
in anger, as it requires a Functor
constraint, and can’t therefore use BLAS
(there’s an issue to fix this).
It must be compositional. Some of the bigger neural nets involve hundreds of layers organised into large complex graphs.
To understand Grenade’s design, it pays to first read Justin Le’s, blog on dependently typed fully connected networks, provided many ideas for the type level ideas I needed. I would also note that some of the types given here may not be exactly what is in the library, as I’ve excluded a few constraints which don’t aid understanding.
Grenade’s Neural networks pass multidimensional matricies between
layers. These matricies, need not be of the same dimensionality,
so we define a Shape
type
data Shape
= D1 Nat
| D2 Nat Nat
| D3 Nat Nat Nat
We can make data of these shapes using GADTs and data kinds
data S (n :: Shape) where
S1D :: ( KnownNat o )
=> R o
-> S ('D1 o)
S2D :: ( KnownNat rows, KnownNat columns )
=> L rows columns
-> S ('D2 rows columns)
S3D :: ( KnownNat rows
KnownNat columns
, KnownNat depth)
, => L (rows * depth) columns
-> S ('D3 rows columns depth)
I have also written some singletons instances for these shapes,
which helps implement a good few functions, such as fromIntegral
.
This helps up march up and down between terms, types and kinds
and work at the type level.
I couldn’t actually find any other examples of writing singletons without template haskell before, so I’ll print them here as an example of what this looks like.
data instance Sing (n :: Shape) where
D1Sing :: KnownNat a => Sing ('D1 a)
D2Sing :: (KnownNat a, KnownNat b) => Sing ('D2 a b)
D3Sing :: (KnownNat a, KnownNat b, KnownNat c) => Sing ('D3 a b c)
instance KnownNat a => SingI ('D1 a) where
= D1Sing
sing instance (KnownNat a, KnownNat b)
=> SingI ('D2 a b) where
= D2Sing
sing instance (KnownNat a, KnownNat b, KnownNat c)
=> SingI ('D3 a b c) where
= D3Sing sing
Next we define what a layer is in Grenade, there’s actually two associated classes.
class UpdateLayer x where
type Gradient x :: *
runUpdate :: LearningPamameters -> x -> Gradient x -> x
createRandom :: MonadRandom m => m x
class UpdateLayer x => Layer x (i :: Shape) (o :: Shape) where
-- | The Wengert tape for this layer.
type Tape x i o :: *
-- | Take the input from the previous
-- layer, and give the output from this layer.
runForwards :: x -> S i -> (Tape x i o, S o)
-- | Back propagate a step, computing the gradients
runBackwards :: x -> Tape x i o -> S o
-> (Gradient x, S i)
The first class UpdateLayer
defines what data structure we can
use to pass around the Gradient
for this layer, as well as how
we can perform a step of stochastic gradient decent.
The second class Layer
is used to define what shapes of data a
layer can transform between. There’s no limit on how many
instances of Layer
a single type can take, as there might be
shapes on which a layer can work.
The Tape
type describes what data is needed to be able to
calculate the partial derivatives leading into the layer (and the
layer’s gradients) given the partial derivatives of its output.
The reason for the two classes is simply that we don’t want
to have to provide Proxys
for the input and output shapes when
they aren’t required, nor duplicate the runUpdate
function and
friends when there is more than one set of shapes which can be
transformed.
As an example, here’ the definition of the Tanh
layer. This
is a relatively simple layer, which simply applies the non-linear
hyperbolic tangent function to activation coming into the layer.
instance UpdateLayer Tanh where
type Gradient Tanh = ()
Tanh () = Tanh
runUpdate _ = return Tanh
createRandom
instance (a ~ b, SingI a) => Layer Tanh a b where
type Tape Tanh a b = S a
= (a, tanh a)
runForwards _ a = ((), tanh' a * g) runBackwards _ a g
One can see that type Gradient Tanh = ()
which means that there
are no learnable parameters, for this layer.
The single Layer
instance, states that the layer can perform the
transformation of any input and output shapes, as long as they are
equal, be they 1, 2, or 3 dimensional.
A layer with slightly more interesting Layer
instances is the
Reshape
layer. One of these is
instance
KnownNat a, KnownNat x, KnownNat y, a ~ (x * y))
(=> Layer Reshape ('D2 x y) ('D1 a) where
...
which says that we can flatten a 2 dimensional image into a single
vector, but only if the total number of points is equal. This is one
of many instances of the Reshape
layer.
Finally an example of a Layer
with learnable parameters is the
FullyConnected
layer.
instance (KnownNat i, KnownNat o)
=> UpdateLayer (FullyConnected i o) where
type Gradient (FullyConnected i o) = FullyConnected' i o
...
instance (KnownNat i, KnownNat o)
=> Layer (FullyConnected i o) ('D1 i) ('D1 o) where
type Tape (FullyConnected i o) ('D1 i) ('D1 o) = S ('D1 i)
...
where the FullyConnected'
type holds the weight matrix for the bias
and activation neurons.
With these building blocks, the definition of a Network
is simple
data Network :: [*] -> [Shape] -> * where
NNil :: SingI i => Network '[] '[i]
(:~>) :: (SingI i, SingI h, Layer x i h)
=> !x -> !(Network xs (h ': hs))
-> Network (x ': xs) (i ': h ': hs)
As well as some very similar heterogeneous like data types for
Gradients
and Tapes
, which are parameterised by the layers and
shapes in a similar way to the Network.
Three functions then provide all we really need to do machine learning
-- | Running a network forwards with some input data.
--
-- This gives the output, and the Wengert tape required for back
-- propagation.
--
runNetwork :: forall layers shapes.
Network layers shapes
-> S (Head shapes)
-> (Tapes layers shapes, S (Last shapes))
-- | Running a loss gradient back through the network.
--
-- This requires a Wengert tape, generated with the appropriate input
-- for the loss.
--
-- Gives the gradients for the layer, and the gradient across the
-- input (which may not be required).
--
runGradient :: forall layers shapes.
Network layers shapes
-> Tapes layers shapes
-> S (Last shapes)
-> (Gradients layers, S (Head shapes))
-- | Apply one step of stochastic gradient decent across the network.
applyUpdate :: LearningParameters
-> Network layers shapes
-> Gradients layers
-> Network layers shapes
what’s really nice about these functions is that they directly mirror the functions in Layers, in fact, this match works so well that we can embed a Network as a Layer in another Network
instance UpdateLayer (Network sublayers subshapes) where
type Gradient (Network sublayers subshapes) = Gradients sublayers
= applyUpdate
runUpdate = randomNetwork
createRandom
instance ( i ~ (Head subshapes)
~ (Last subshapes)
, o => Layer (Network sublayers subshapes) i o where
) type Tape (Network sublayers subshapes) i o = Tapes sublayers subshapes
= runNetwork
runForwards = runGradient runBackwards
Combined with a Concat
layer, which runs two layers in parallel and
concatenates their output, we can now recreate directed acyclic graphs of
layers in Grenade in a composable manner.
As an example, we can train two parallel convolutional Networks for MNIST using different kernel sizes, and combine their output before running the fully connected layers.
type MNIST5x5 =
Network
Convolution 1 10 5 5 1 1, Pooling 2 2 2 2, Relu
'[ Convolution 10 16 5 5 1 1, Pooling 2 2 2 2, Relu
, FlattenLayer ]
, 'D2 28 28, 'D3 24 24 10, 'D3 12 12 10, 'D3 12 12 10
'[ 'D3 8 8 16, 'D3 4 4 16, 'D3 4 4 16, 'D1 256 ]
,
type MNIST7x7 =
Network
Convolution 1 10 7 7 1 1, Pooling 2 2 2 2, Relu
'[ Convolution 10 16 4 4 1 1, Pooling 2 2 2 2, Relu
, FlattenLayer ]
, 'D2 28 28, 'D3 22 22 10, 'D3 11 11 10, 'D3 11 11 10
'[ 'D3 8 8 16, 'D3 4 4 16, 'D3 4 4 16, 'D1 256 ]
,
type MNIST =
Network
Concat ('D1 256) MNIST5x5 ('D1 256) MNIST7x7, FlattenLayer
'[ FullyConnected 512 80, Logit, FullyConnected 80 10
, Logit ]
, 'D2 28 28, 'D2 2 256, 'D1 512, 'D1 80, 'D1 80
'[ 'D1 10, 'D1 10 ] ,
We can also write an Inception network, which can be embedded into a Network as a Layer easily.
Grenade does a lot of things right, it’s very expressive in my view, and makes the construction of networks incredibly safe: they essentially can’t crash, and they’ll always be sound. It’s also pretty quick, using BLAS and some hand written C, I’ve found this acceptable for a lot of problems.
The biggest downside is related to speed however, Grenade is CPU only, and at this stage, does not use minibatching. I don’t think this is a big deal for a lot of architectures, but for some (LSTM) it does come at a cost.
The next big downside is related to optimisation algorithms. At the moment, I just use stochastic gradient decent with momentum, and this is baked into the structure of each layer a bit too much. It’s totally reasonable to ask for a neural network library to permit the ADAM optimiser, or something else.
There are actually haskell versions of a lot of these algorithms
already, but they require a Functor
constraint on the types, which
I don’t think I can do while maintaining the performance
characteristics of the library.
I think there is a decent solution, where the Gradient
type
becomes the weights for a Layer
, with a dictionary providing just
enough operations on it for the optimiser to be able to decide what
extra information it wants to hold onto (frequency of updates, last
update… etc).