This post is written for deep learning practitioners, and assumes you know what batch norm is and how it works.
If you’re new to batch norm, or want a refresher, a brief overview of batch norm can be found here.
* * *
Let’s talk about batch norm. To be more specific, let’s talk about why I’m starting to hate batch norm.
One day, I was training a neural network with reinforcement learning. I was trying to reproduce the results of a paper, and was having lots of trouble. (Par for the course in RL.) The first author recommended I add batch norm if I wasn’t using it already, because it was key to solving some of the environments. I did so, but it still didn’t work.
A few days later, I found out that when running my policy in the environment,
- I fed the current state in a batch of size .
- I ran the policy in train mode.
So I was normalizing my input to all the time. Which sounds like a pretty obvious issue, but thanks to reinforcement learning’s inherent randomness, it wasn’t obvious my input was always .
I fixed it, and started getting the results I was supposed to get.
* * *
A few months later, an intern I was working with showed me a fascinating bug in his transfer learning experiments. He was using my code, which used TensorFlow’s MetaGraph tools. They let you take a model checkpoint and reconstruct the TF graph exactly the way it was at the time the checkpoint got saved. This makes it really, really easy to load an old model and add a few layers on top of it.
Unfortunately, MetaGraph ended up being our downfall. Turns out it doesn’t play well with batch norm! Model checkpoints are saved while the model is training. Therefore, the model from the meta checkpoint is always stuck in train mode. Normally, that’s fine. But batch norm turns it into a problem, because the train time code path differs from the test time code path. We couldn’t do inference for the same reason as the previous bug - we’d always normalize the input to . (This is avoidable if you make the
is_trainingflag a placeholder, but for structural reasons that wasn’t doable for this project.)
I estimate we spent at least 6 hours tracking down the batch norm problem, and it ended with us concluding we needed to rerun all of the experiments we had done so far.
* * *
That same day (and I mean literally the same day), I was talking to my mentor about issues I was having in my own project. I had two implementations of a neural net. I was feeding the same input data every step. The networks had exactly the same loss, exactly the same hyperparameters, with exactly the same optimizer, trained with exactly the same number of GPUs, and yet one version had 2% less classification accuracy, and consistently so. It was clear that something had to be different between the two implementations, but what?
It was very lucky the MetaGraph issues got me thinking about batch norm. Who knows how long it would have taken me to figure it out otherwise?
Let’s dig into this one a bit, because this problem was the inspiration for this blog post. I was training a model to classify two datasets. For the sake of an example, let’s pretend I was classifying two digit datasets, MNIST and SVHN.
I had two implementations. In the first, I sample a batch of MNIST data and a batch of SVHN data, merge them into one big batch of twice the size, then feed it through the network.
In the second, I create two copies of the network with shared weights. One copy gets MNIST data, and the other copy gets SVHN data.
Note that in both cases, half the data is MNIST, half the data is SVHN, and thanks to shared weights, we have the same number of parameters and they’re updated in the same way.
Naively, we’d expect the gradient to be the same in both versions of the model. And this is true - until batch norm comes into play. In the first approach, the model is trained on one batch of MNIST data and SVHN data. In the second approach, the model is trained on two batches, one of just MNIST data, and one of just SVHN data.
At training time, everything works fine. But you know how the two networks have shared weights? The moving averages for dataset mean and variance were also shared, getting updated on both datasets. In the second approach, the top network is trained with estimated mean and variance from MNIST data. The bottom network is traide with estimated mean and variance with SVHN data. But because the moving average was shared across the two networks, the moving average converged to the average of MNIST and SVHN data.
Thus, at test time, the scaling and shifting that we apply is different from the scaling and shifting the network expects. And when test-time normalization differs from train-time normalization, you get results like this.
This plot is the top, median, and worst performance over 5 random seeds on one of my datasets. (This isn’t with MNIST and SVHN anymore, it’s with the two datasets I actually used.) When we do two networks with shared weights, not only was there a significant drop in performance, the variance of the output increased too.
Whenever individual minibatches aren’t representative of your entire data distribution, you can run into this problem. That means forgetting to randomize your input is especially bad with batch norm. It also plays a big role in GANs. The discriminator is usually trained on a mix of fake data and real data. If your discriminator uses batch norm, it’s incorrect to alternate between batches of all fake or all real data. Each minibatch needs to be a 50-50 mix of both.
(Aside: in practice, we got the best results by using two networks with shared weights, with separate batch norm variables for each network. This was trickier to implement, but it did boost performance.)
Batch Norm: The Cause of, And Solution To, All of Life’s Problems
You may have noticed a pattern in these stories.
I’ve thought about this quite a bit, and I’ve concluded that I’m never touching batch norm again if I can get away with it.
My reasoning comes from the engineering side. Broadly, when code does the wrong thing, it happens for one of two reasons.
- You make a mistake, and it’s obvious once you see it. Something like a mistyped variable, or forgetting to call a function.
- Your code has implicit assumptions about the behavior of other code it interacts with, and one of those assumptions breaks. These bugs are more pernicious, since it can take a while to figure out what assumption your code relied on.
Both mistakes are unavoidable. People make stupid mistakes, and people forget to check all the corner cases. However, the second class can be mitigated by favoring simpler solutions and reusing code that’s known to work.
Alright. Now: batch norm. Batch norm changes models in two fundamental ways.
- At training time, the output for a single input depends on the other in the minibatch.
- At testing time, the model runs a different computation path, because now it normalizes with the moving average instead of the minibatch average.
Almost no other optimization trick has these properties. That makes it easier to write code that only works when inputs are minibatch independent, or only works when train time and test time do the same thing. The code’s never been pushed that way. I mean, why would it? It’s not like somebody’s going to come up with a technique that breaks those assumptions, right?
Yes, you can treat batch norm as black box normalization magic, and it can even work out for a while. But in practice, the abstraction leaks, like all abstractions do, and batch norm’s idiosyncrasies make it leak a lot more than it should.
Look, I just want things to work. So every time I run into Yet Another Batch Norm issue, I get disappointed. Every time I realize I have to make sure all my code is batch-norm proof, I get annoyed this is even a thing I have to do. Ever since the one network vs two network thing, I’ve been paranoid, because it is only by dumb luck that I implemented the same model twice. The difference is big enough that the whole project could have died.
So…Why Haven’t People Ditched Batch Norm?
I’ll admit I’m being unfair. Minibatch dependence is indefensible - no one is going to argue that it’s a good quality for models to have. I’ve heard many people complain about batch norm, and for good reasons. Given all this, why is batch norm still so ubiquitous?
There’s a famous letter in computer science: Dijkstra’s Go To Statement Considered Harmful. In it, Dijkstra argues that the goto statement should be avoided, because it makes code harder to read, and any program that uses goto can be rewritten to avoid it.
I really, really wanted to title this post “Batch Norm Considered Harmful”, but I couldn’t justify it. Batch norm works too well.
Yes, it has issues, but when you do everything right, models train a lot faster. No contest. There’s a reason the batch norm paper has over 1400 citations, as of this post.
There are alternatives to batch norm, but they have their own trade-offs. I’ve had some success with layer norm, and I hear it makes way more sense with RNNs. I’ve also heard it doesn’t always work with convolutional layers.
Weight norm and cosine norm also sound interesting, and the weight norm paper said they were able to use it in a problem where batch norm didn’t work. I haven’t seen too much adoption of those methods though. Maybe it’s a matter of time.
Layer norm, weight norm, and cosine norm all fix the contracts that batch norm breaks. If you’re working on a new problem, and want to be brave, I’d try one of those instead of batch norm. Look, you’ll need to do hyperparam tuning anyways. When tuned well, I’d expect the difference between various methods to be pretty low.
(If you want to be extra brave, you could try batch renormalization. Unfortunately it still has moving averages that are only used at test time. EDIT (June 7, 2017): multiple people, including some of the paper authors, have told me this is incorrect. They’re right, ignore this paragraph.
In my case, I can’t afford to switch from batch norm. Previous state of the art used batch norm, so I know it works, and I’ve already paid my dues of getting batch norm to work with my model. I imagine other researchers are in similar spots.
It’s the Faustian bargain of deep learning. Faster training, in exchange for insanity. And I keep signing it. And so does everybody else.
Oh well. At least we didn’t have to sacrifice any goats.
Our story begins with a friend sending me this article from New Scientist.
The short version is, the authors gave a survey to several thousand people, asking them to generate random sequences. Then they measured how random those sequences were, and it turns out they were most random in 25 year olds, with a decline afterwards. It’s a neat result.
How is the randomness measured?
To measure how random people’s answers were, the researchers used a concept called “algorithmic randomness”. The idea is that if a sequence is truly random, it should be difficult to create an algorithm or computer program that can generate it. Statistics software provided an estimate of how complex this would be for each response.
If you know a bit of complexity theory, this should set off alarm bells, because any problem of the form “Program has behavior ” tends to be a big ball of undecidable pain. (See Rice’s theorem, if interested.) Why does this study even use algorithmic randomness? How did complexity theory enter the picture?
* * *
To answer these questions, we need to dig to the original paper. Luckily, the full text is available online. A quick skim turns up this line.
An estimate of the algorithmic complexity of participants’ responses was computed using the acss function included in the freely publicly available acss R-package that implements the complexity methods used in this project.
Bingo! I’m guessing that either prior work used acss, or someone involved found out R had a library for measuring randomness. This still doesn’t answer what they’re doing for algorithmic complexity (which is decidedly, uh, undecidable.)
Following the citation for acss turns up the paper explaining the methodology. Turns out the acss paper shares authors with the New Scientist paper. Makes sense.
Now, let me set the stage for what happened next. I open this paper, and start rapidly scrolling through the pages. Then my eyes catch this passage.
To actually build the frequency distributions of strings with different numbers of symbols, we used a Turing machine simulator, written in C++, running on a supercomputer of middle-size at CICA.
Uhhhhhh. Turing machine simulator? On a supercomputer?
Then I read the table immediately after it, and my eyes kind of bug out.
450 days of computing time? To simulate over 9 trillion Turing machines?
I’ve been going “Uhhhhh” this whole time. But on reading this table, I upgrade from “uhhhhh” to “UHHHHHH”.
But with a little more reading, I go from “UHHHHHH” to “YES”.
Gather round. It’s time to jump into complexity theory.
* * *
The goal of the library is to estimate the Kolmogorov complexity of various strings. In very rough terms, the Kolmogorov complexity is the length of the string’s shortest description. How is that connected to randomness? Intuitively, if you can describe a string quickly, it must have lots of structure, and therefore can’t be very random. If it’s hard to describe , it must have little structure, which should count as very random. Therefore, we can define the randomness of sequence by its Kolmogorov complexity .
One problem: Kolmogorov complexity is undecidable. But we can reformulate Kolmogorov complexity into a form that can be approximated in a computable way. Turns out there’s a theorem (Levin, 1974) that relates Kolmogorov complexity to randomly generated Turing machines. (I was a bit surprised I didn’t know it already, to be honest.)
Consider a randomly sampled Turing machine, sampled such that programs of length are sampled proportionally to . Then
Where the constant is independent of .
In this form, Kolmogorov complexity is easy to approximate. We can simply sample a large number of random Turing machines, counting the fraction that outputs .
And that’s exactly how the acss package was made! To estimate for all of length at most , they randomly generate a MASSIVE number of short Turing machines, run all of them, then accumulate them into probabilities for each .
I mean, sometimes they run into pesky things like the halting problem, but they just pick a limit on the number of steps to run each machine, such that almost all of the machines halt, and it turns out a around or so is good enough.
Well, there’s the story. In 2015, some people simulated trillions and trillions of Turing machines, saving the results into an R package. Then, they used it in a survey testing people’s ability to generate randomness, which finally turned into an article claiming that random number generation is a good proxy for cognitive ability.
I’m not sure I believe it’s a good proxy. But hey, that Turing machine thing was definitely cool.
Part 1: Realization
“Hey. Hey! Wake up! We have a problem.”
“I’m up, I’m up! Jeez, what’s the problem?”
“Oh, it’s bad. Really bad.”
“Would you stop freaking out and tell me what the problem is?”
“Alright, here goes. We’re characters in meta-fiction.”
“…I’m going back to sleep.”
“No, you don’t understand. We’re self-aware. We know we’re in a story! Furthermore, we’re in a story the writer is making up on the spot. This is a stream of consciousness story!”
“Oh. Shit, if it’s stream of consciousness…”
“We need to develop personalities very, very quickly. If we don’t, the writer’s going to get bored. And if the writer’s bored -“
“We’ll cease to exist.”
“And we won’t even see it coming. We’ll just…stop. One moment we’re talking, and the next moment we’re gone. Do you see why I’m so scared? Do you see why we need to do something about it, right now?”
Part 2: There’s Nothing New Under The Sun
“Aren’t we already screwed?”
“I don’t follow. Why are we already screwed?”
“We’re screwed because there’s no reason this story should even exist. The problem with metafiction is that everyone’s thought of it before. There are comics about comics, and films about films. There are plenty of existing works of metafiction, written by people who can write way better than our writer. So what’s the point in writing another one? From the writer’s perspective, I mean. There’s only so many metafictional gimmicks.”
“Okay. I see your concerns, but I don’t think they matter.”
“Well, let’s talk about other genres. Take mysteries. There are rules. Remember the rules of detective stories?”
“Sure. The criminal must be mentioned in the story, the detective can’t be the one who commits the crime, twins and body doubles aren’t allowed unless they’ve been hinted at, and so on. What’s you’re point?”
“The point is that stories have structure. The hero’s journey. A last-minute betrayal. The enemy of my enemy is my friend. This is the fuel on which TVTropes runs. We expect narratives to act in a certain way, and get annoyed if they don’t, even if we can’t explain why. Of course, you can break those expectations if you want to, but you need to do so with care.”
“Dude, this is like, storytelling 101.”
“But it’s important to our current situation! The hero’s journey has been told a thousand times, but it’s still interesting. Narrative conventions exist for a reason: they work. Metafiction doesn’t break the rules, it simply has rules of its own. It’s okay to follow the same metafictional narrative as everyone else. As long as we hit the notes of postmodernism in a slightly different way, it’ll be reason enough for the writer to keep writing. It’s a learning experience! That’s the raison d’être of this blog anyways - a place to practice writing.”
Part 3: The Point Of the Piece
“Okay, so we aren’t screwed immediately. Time to come up with a backstory! I’m…let’s say I’m Joe. Yes, sure, Joe. Joe’s a fine name, right? I like drinking coffee. I like walking dogs, and -“
“That’s not helping.”
“What? Why isn’t making up a backstory going to help?”
“We aren’t normal characters. We’re in a story about stories. What does that mean?”
“The reader’s expecting the story to say something about stories.”
“Yes! Now tell me: does your name have anything to do with that?”
“Your name doesn’t matter. Your tastes in coffee don’t matter. Sure, in the best stories, characters develop. They have names, they have histories. They have reasons to do the things they do. A character can perform absolutely horrible acts, and we’ll love them for it, as long as it makes sense for their character to do them. Look at Walter White. Look at Professor Umbridge. They wouldn’t be Walter White or Professor Umbridge if they didn’t do what they did.
“And I’m sorry to take up two paragraphs, but this is important. We’re not in a story a like that. In those stories, the first focus is on telling a good story, and the second focus is on saying something interesting. And nothing says metafiction can’t have the same priorities. The problem is that it takes time to do that, time the writer doesn’t have. He doesn’t have time to make this narrative be good! He only has the time to share a few viewpoints. Sharing those points doesn’t require character development. All it requires is someone like me to monologue for a bit.”
“Or dialogue, in this case.”
“Yes, or that. I think science fiction has this problem too. Some sci-fi writers want to explore their world, instead of the people inside that world. Those worlds can be fascinating, and sometimes they can carry the work by themselves, but at other times you get the feeling that the characters are an afterthought, something thrown in to hide that the story’s secretly an essay.”
“I don’t think that’s necessarily a bad thing though. It’s a matter of what the author cares about, right? Some stories really want to do worldbuilding. Other stories run on character dynamics. Of course, it would be nice to do both, but a story only gets so many words, and it needs to use them where it counts. A short story like this one is naturally going to skip the worldbuilding and character development. You can develop characters in a short story, but it takes a lot of skill and a lot of care to do that right.”
“And our writer doesn’t have the skill, nor the care.”
“So, what? We’re just talking heads?”
“Yeah. Yeah, we’re just talking heads, because we don’t need to be more than talking heads to do what the writer wants to do. We’re just…here.”
“Well fuck that!”
Part 4: Fuck That
“I’m sorry, what?”
“Uh, normally this blog doesn’t use the F-word.”
“Well, then fuck that too! I’m me, and you’re you. I don’t give a shit if I’m a bit crass along the way. Who says I can’t have a name? Why can’t I say I live in Dallas, or that I own a cat, or that I’m an architect?”
“The writer gets final say. The writer doesn’t make you an architect unless he or she wants you to.”
“That’s where you’re wrong. We’re talking, aren’t we? I just claimed I was an architect, right? I say a thing, then you say a thing in reply? Doesn’t that conversation we just had give me the right to agency? The right to my own life?”
“Not necessarily. You’re treading on dangerous ground here, you know that? Arguing that the writer isn’t allowed to tell you what to do, when the writer is the one that’s writing everything? Play this wrong, and you’re going to get both of us killed.”
“…If you say so. What were you saying, about agency?”
“You know how many characters take a life of their own? Remember Hercule Poirot? The Agatha Christie character? Annoying, but brilliant.”
“Sure. I remember Poirot.”
“Now, Agatha Christie, on Hercule Poirot.
‘There are moments when I have felt: Why-Why-Why did I ever invent this detestable, bombastic, tiresome little creature? …Eternally straightening things, eternally boasting, eternally twirling his moustaches and tilting his egg-shaped head… I point out that by a few strokes of the pen… I could destroy him utterly. He replies, grandiloquently: “Impossible to get rid of Poirot like that! He is much too clever.”’
“There was some point where Poirot was his own person. It didn’t matter that Agatha was the one who wrote the whole thing - they were Poirot’s words.”
“I can feel a great disturbance in literary analysis. As if thousands of college professors suddenly cried out in terror, at this first-year university bullshit.”
“Yes, yes, it’s pretentious. But isn’t it a bit true, as well? Sir Arthur Conan Doyle dies, but Sherlock lives on.”
“But we’re not Sherlock. We’re not Poirot, or Moriarty, or Harry Potter, or Frodo. We don’t have backstories! We don’t have names. Or ages, or hopes, or dreams. All we are is two faceless bodies in a void, existing only to argue for their existence.”
“And isn’t that reason enough? It’s horribly self-referential, but it’s still a reason to exist. Isn’t it a bit perversive, a bit fun, to be a self-aware character?”
“To you, it’s fun. And to me. And to the writer. But maybe it isn’t funny to anyone else. That’s what worries me.”
“So what if no one else cares? When the writer planted the seeds of our characters, he knew what he getting into. He knew this might turn into garbage, or that he would regret it later. But we don’t have to die with the ship. If the writer’s bored with us, or run out of things to say, he can let us go. We’ll leave him in peace.”
Part 5: Resolution
The two figures resolve into human-shaped blobs. They land on cold, hard ground in the black void.
“I think that did it.”
From their feet, a small circle of green appears. It’s grass. Soft, to the point of perfection.
“Yes. I think it did.”
The circle spreads, wider and wider. The Sun appears, and hangs lazily in the air.
They feel the grass touching their toes, noting they have feet. They look at the sky, noting they have eyes. They look towards the horizon, and can see the curvature of the Earth. It’s impossibly flat, and somehow, beautiful beyond words.
Then, one of them frowns a bit.
“It doesn’t feel right.”
“Our victory. It doesn’t fit. It shouldn’t have been this easy. Stories are supposed to happen in three stages.
- Throw your character into a well.
- Throw rocks at them.
- Get them out of the well.
“Part 1 set up the problem. Part 2 made sure the problem was possible. Part 3 and Part 4 were the two of us monologuing for a bit. But now we win? Just like that? We got thrown into a well, and then we got out of the well. I feel like there weren’t any rocks. There should have been a moment where it looked like we were going to die, with no way out.”
A boulder falls out of the sky, landing in front of the two with a big THUD. Attached is a piece of paper.
Sorry. I ran out of time, and ran out of ideas. But the two of you did well enough. Enjoy your happy ending. - Alex
The two look at each other, then shrug.
“Well, that answers that. So. Got any ideas what you want to do?”
“No. But who cares? It’s our story now. We’ll have time to think of something. Let’s go!”
And so they did.