Posts
-
Emergency Post Because OpenAI Just Revealed Their Dota 2 1v1 Bot And I Have Opinions
See title!
The International is the tournament for Dota 2, and OpenAI just demoed their bot that went undefeated against several pro players in 1v1..
Is this awesome? Yes.
Does this represent a step forward in what machine learning can do? Maybe. That’s where it gets more complicated.
* * *
Let’s start with a broader question: why do machine learning researchers make game AIs?
I claim they do so for two reasons.
- To prove it’s possible for computers to beat humans at the game.
- To use the game as an environment for developing new machine learning techniques.
Now, there’s a healthy intersection between gamers and ML researchers, so these objectives certainly aren’t mutually exclusive. I bring it up because a number of game AIs are very tailored to the game they’re trying to solve. To judge whether a game AI is impressive or not, you need to know details behind how it works.
For example, there’s a Pokemon speedrun bot that tries to complete Pokemon Red as quickly as possible. It’s achieved record times, but it didn’t learn how to do it. Rather, a human designed a flowchart of what the bot should do, and it follows it until it achieves success.
Similarly, a while back there was some buzz about a computer that beat an expert fighter pilot at an aerial combat simulator. It rather predictably spawned many alarmist headlines. Read the details, however, and you find it’s mostly a rule-based AI (if this then that), learned through genetic algorithms, with some expert knowledge added to the training process. Still cool, but a lot of domain specific knowledge went into the problem.
Neither of these are super impressive from the standpoint of seeing what machine learning is capable of, because they both rely on a human providing specific domain knowledge.
In contrast, AlphaGo was impressive because it both solved a new game, and did so by combining supervised pretraining + reinforcement learning + Monte Carlo Tree Search in a way that hadn’t been done before. Generally, the best games for research are ones where developing new techniques is necessary to achieve superhuman performance, and Go was an example of this.
Now, DeepMind cares about StarCraft, because real-time is hard, partial information is hard, the wide variety of units means you have lots to learn, and most importantly the long-term strategy and planning is something that RL systems tend to lack.
Dota 2 fills a similar niche - also real-time, also partial information, also has tons of heroes, and also requires long-term planning to play well. With current approaches, we can’t solve StarCraft, and we can’t solve Dota 2. We need new ideas to teach computers to play these games at a competent level.
That’s why people care.
* * *
So far, OpenAI has not released many details behind the bot. Speculating on what the bot means for the field would be massively irresponsible, until more info comes out to provide context.
That being said…speculating how the bot might work is also massively fun. So let’s get into it!
First of all, the bot most likely only know how to play 1v1, Shadow Fiend vs Shadow Fiend. Rule of thumb: for demos and for papers, assume that if something isn’t mentioned or shown, it isn’t proven to work yet. This is not a strike against the work! There’s no reason to make the problem more difficult if you aren’t even sure the problem is solvable.
Next, from the header:
The bot learned the game from scratch by self-play, and does not use imitation learning or tree search.
So most likely, it’s reinforcement learning. Restricting the game to only 1v1 Shadow Fiend gives you a ton of nice things.
- The model only needs to understand how a single character in the game works.
- The same policy can be used for both players.
In many ways, this reminds me of the Super Smash Brothers Melee bot that beat several pro players - it can beat pros, but only if both players are Captain Falcon, and only if the players are P1 and P2, and only if they play on Battlefield. Cool? Yes. Limited? Also yes.
The existence of that SSBM bot makes me suspect there weren’t many breakthroughs behind this bot, just some grunt work on getting the system together and getting an off-the-shelf RL algorithm to train. I don’t play Dota (used to play League), but from what I’ve heard 1v1 laning is much more about character micro. I can see RL learning to deal with that - it’s the macro level play that I expect to be hard.
Assuming it’s an RL approach, there are several follow-up questions.
- How is game state getting represented?
- How are game actions getting represented?
- How often does the AI output actions?
- How is reward defined?
- What reinforcement learning algorithm is used?
- How long does it take to train?
The impressiveness of the achievement depends on the answer to these questions. Once again, we don’t know details yet. But I’m going to throw out some guesses. Hopefully they aren’t completely wrong, I’ll update when more information comes out.
-
State representation: I assume OpenAI is collaborating with Valve on this. Valve has an official Dota 2 API, and I assume they’re using it, because why wouldn’t you? I mean, you could try to learn from raw pixels, but that would be ridiculously impressive, almost too much so. It’s much more plausible that it’s getting exact positions and health of all units within its vision, which certainly helps on the micro front. Human players have all that info too, but humans don’t have the attentional bandwidth to know everything down to the precision that an API could give you.
-
Action representation: I would guess raw (x, y) locations for mouse clicks, the skill hotkeys, and special actions for clicking on units. In theory you only need mouse clicks, but if you’re already getting information from the game API, you might as well use the API for targeting functionality too. That way your model doesn’t have to how to learn how to click on units, it can focus on learning when clicking units is good and when clicking units is bad.
-
Action frequency: No idea. They say that their APM is comparable to an average human player, which gives some indication of how often the network can send commands. I’m mostly curious on whether there are limits on APM, or if the neural net is slow enough to give human-level APM automatically.
This is one way real time problems can be more difficult - in the latency between polling for an action, computing an action, and sending it back to the game, the environment could be in a very different state. On the other hand, humans have the same problem - we have to process the game visuals, decide on an action, then send the command to our hands. Clearly we can deal with it fine, so maybe I’m overestimating how important this is.
Aside: in StarCraft, you can make much better usage of APM because you have an entire army to command. Let a StarCraft bot do thousands of actions per minute and you can get ridiculous things like this. In Dota 2, because there’s just one controllable character, there’s a cap on how much APM can improve your gameplay. This is mentioned in the blog post too, but I want to emphasize it. Whenever people post StarCraft bots, people invariable say they’ll wait for human level APM. In Dota 2’s case, I think it actually won’t matter that much - what matters is precision of movement and knowledge of health bars down to a single hit point. That’s where computers have the advantage.
- Reward definition: Here’s where I get really curious. I assume the reward isn’t just “1 if you win, -1 if you lose.” If it is, that would be insane. I mean, it’s not of the question - I think that barely kisses the realm of possibility. But I’d expect that the reward is shaped, giving partial reward based on EXP, gold, or health. (In RL, we say a reward is shaped if it gives small reward for doing things you expect to be useful. Reward shaping is one of the big ways to add domain knowledge to your problem.)
Generally, shaped rewards are easier to learn from than the pure “win/loss” signal, but only if the shaping is done well, and tuning the reward can be a huge pain. I’d actually rather have it be the win/loss reward, because if it was, it would be even cooler that the bot can learn so many complex behaviors from purely win/loss.
It could even be using the system from the Learning from Human Preferences paper - I could see it being useful for given partial signal to behaviors that are hard to define.
-
Learning algorithm: Probably PPO. It works well, and it supports continuous control, which may be a better fit for mouse movement.
-
Training time: “Two weeks real time”, except that says basically nothing for how many machines they used, whether they were able to run the game faster than real time, etc. For context, DeepMind says they got up to 2000 frames per second in a single thread for their StarCraft API. Let’s just say it took “a while” and leave it at that.
-
Read-through: Hyperparameter Optimization: A Spectral Approach
Similar to Wasserstein GAN, this is another theory-motivated paper with neat applications to deep learning. Once again, if you are looking for proof details, you are better off reading the original paper. The goal of this post is to give background and motivation.
Why Is This Paper Important?
Hyperparam optimization is a weird problem because everybody has to do it and nobody really likes it. The common options are either grid search, random search, or line search if your parameter is continuous. A mix of those does a pretty good job for most ML algorithms, even deep neural nets, which are a bit notorious for having many hyperparams.
However, I still think it’s worth keeping an eye on hyperparam optimization, for a couple reasons.
- If hyperparam optimization is better, more aspects of neural net design can be described as a hyperparam and automated. For example, the number of layers can be an explicit hyperparam, instead of something picked by hand. Right now, people don’t do this because it’s not worth doing for most problems, but the trend in research is that everybody pushes at the limits of their tools.
- Preliminary results like Neural Architecture Search suggests that there is learnable structure in neural net hyperparameters. It’s not yet applicable to anything, but I could see meta-learned models playing a bigger role in the future. At the same time, maybe we don’t need to throw a neural net at everything to get state-of-the-art results. (I like doing that as much as the next person, but machine learning doesn’t always have to mean deep learning.)
- There are simply a ton of interesting research questions that appear when each data point take hours or days to collect. Similar issues show up in robotics learning - robots are expensive and often need human supervision, which makes it hard to collect the ImageNet-scale datasets that neural nets are so good at.
- Better hyperparam optimization makes ML more accessible to others, because it reduces the amount of esoteric knowledge needed to get models to work. That’s good because people shouldn’t need to know that esoteric knowledge to solve their problems. It’s bad for me because weird deep learning expertise is one of the few things I’m good at, and I’m not sure I like knowing it could be obsolete down the line. All the more reason to pay attention.
Some Quick Background on Hyperparam Optimization
For two approachable introductions to the subject, I highly recommend two blog posts from Ben Recht’s blog, linked here and here. The key takeaways are
- Random search is a simple, deceptively strong baseline.
- Bayesian optimization can outperform random search by a bit.
- However, if you give random search twice the time, it beats a lot of Bayesian optimization methods. Yes, it’s more compute time, but sometimes buying more machines is easier than integrating clever ideas into your software stack.
- Successive halving (SH) is a nice extension of random search that lets you run many more hyperparams in the same amount of time, by pruning poor hyperparams early on. You have to be a little careful with how aggressively you prune, but in practice it does pretty well.
Problem Setting
The paper focuses on discrete hyperparameter optimization. Generally, discrete optimization is trickier, and continuous problems can be approximated by discretizing the space, so I think this is an okay choice to make.
The paper further assumes all hyperparams are binary variables with value \(-1\) or \(+1\). Note you can still do things like allocate \(3\) hyperparams for the \(2^3 = 8\) different learning rates you want to try.
Let’s say there are \(n\) hyperparams. Then given all these assumptions, we want to learn a function \(f: \{-1,+1\}^n \to \mathbb{R}\), where the input is the hyperparam setting, and the output is the performance of the trained model. In the upcoming sections, we’ll use two different notations.
- \(f(x)\), in which case \(x = (x_1,x_2,x_3,\ldots, x_n)\) is some \(n\)-bit vector.
- \(f(x_1,x_2,\ldots,x_n)\), in which case we interpret \(f\) as a polynomial of \(n\) variables, where the variables equal either \(-1\) or \(+1\).
Polynomial Learning as Compressed Sensing
This is the key guiding framework of the algorithm. Viewing \(f(x)\) as a polynomial, we learn a sparse, low-degree polynomial \(g(x)\) that approximates \(f\).
What’s the intuition for why you’d want to do this? Well, to be honest, I have no idea why you’d decide to look at compressed sensing, instead of the other methods out there. But I can explain why it makes sense to me, after the fact.
To do better than random search, two things must be true.
- Your problem must have learnable structure.
- Your hyperparam optimization methods needs to discover and exploit that latent structure.
All of machine learning can be seen as fitting low-dimensional functions to inputs in high-dimensional space. This gets at why fitting a low-degree polynomial would make sense.
But then why include sparsity? Remember that we’re going to have very few data points, because each data point requires training a neural net to convergence. Sparsity helps with this by reducing the capacity of the polynomial fitting process - it stops it from putting a little bit of weight on thousands of different terms, which is one of the easiest ways to overfit. As a side effect, sparsity tends to make interpreting models easier.
(Aside: this is one of the common complaints about deep learning - it’s hard to interpret much out of the sum of 100 activations. But in practice, most deep learning problems have a lot of data, and enforcing sparsity just ends up hurting performance.)
When deriving the algorithm, we’ll assume that evaluating \(f(x)\) for a single \(x\) takes “a long time, but not too long”. By this, we mean functions run in \(\Omega(n^d)\) time for a somewhat large \(d\), but still run faster than things with exponential runtime.
The Parity Function Basis
To fit a low-degree polynomial, first note that the Boolean functions \(f: \{-1,+1\}^n \to \mathbb{R}\) form an inner product space. Define the inner product as
\[\langle f, g\rangle = \mathbb{E}_{x} [f(x)g(x)]\]where the expectation is the uniform distribution over \(n\)-bit vectors. We can verify this satisfies the inner product definitions.
-
Symmetry:
\[\langle f, g\rangle = \mathbb{E}_{x} [f(x)g(x)] =\mathbb{E}_{x} [g(x)f(x)] = \langle g,f \rangle\] -
Linearity: for some constant \(a\),
\[\langle af, g \rangle = \mathbb{E}_{x} [af(x)g(x)] = a \mathbb{E}_x [f(x)g(x) = a\langle f, g\rangle\]For functions \(f_1, f_2\),
\[\langle f_1+f_2, g\rangle = \mathbb{E}_x[(f_1(x) + f_2(x))g(x)] = \mathbb{E}_x [f_1(x)g(x)] + \mathbb{E}_x [f_2(x)g(x)] = \langle f_1, g\rangle + \langle f_2, g\rangle\] -
Positive definiteness.
\[\langle f, f\rangle = \mathbb{E}_x[f(x)^2] \ge 0\]
For \(\langle f, f\rangle\) to be \(0\), note \(f(x)^2 \ge 0\), so the only way for \(\mathbb{E}_x[f(x)]\) to equal \(0\) is if \(f(x) = 0\) everywhere.
Next, we want to find an orthonormal basis for this inner product space. For these functions, there’s a well known example. Let \(S \subset [n]\) be a subset of indicies \(1\) to \(n\). Define \(f_S(x)\) as
\[f_S(x) = \prod_{i \in S} x_i\]The functions \(\{f_S\}\) (known as the parity functions) form an orthonormal basis.
Why is this an orthonormal basis? A set \(\{f_1, f_2, \ldots, f_n\}\) is orthonormal if it satisfies two properties.
- For every \(f_i\), \(\langle f_i, f_i \rangle = 1\).
- For every \(f_i, f_j\) where \(i \neq j\), \(\langle f_i, f_j \rangle = 0\).
Now, we show the parity functions have this property.
In the upcoming sections, sometimes \(i\) means an index from \(1\) to \(n\), and sometimes \(i\) means a number from \(0\) to \(2^{n-1}\) that should be interpreted as \(n\) bits of binary. It should be inferable from context.
- By definition, \(\langle f_i, f_i \rangle = \mathbb{E}[f_i(x)^2]\). Note that \(f_i(x) = -1\) or \(f_i(x) = 1\), depending on how many matching bits there are between \(i\) and \(x\). In either case, \(f_i(x)^2 = 1\), so \(\langle f_i, f_i \rangle = 1\).
- For any two different subsets \(S_1, S_2\), there exists some \(i\) in one, but not the other. Without loss of generality, let \(i \in S_1\) and \(i \notin S_2\). We can write \(f_{S_1}(x)\) as
Take the expectation over \(x\), and factor it into the conditional expectation on \(x_i\).
\[\langle f_{S_1}, f_{S_2} \rangle = \mathbb{E}_x[f_{S_1}(x)f_{S_2}(x)] = \mathbb{E}_{x_i}\left[x_i\mathbb{E}_{x|x_i}\left[f_{S_2}(x)\prod_{j \in S_1 - \{i\}} x_j\right]\right]\]The inner expectation equals \(c\), for some constant \(c\) that doesn’t depend on \(x_i\). Thus we can write the expectation as
\[\mathbb{E}_{x_i}\left[x_i \cdot c\right] = \frac{1}{2}\cdot c + \frac{1}{2} \cdot -c = 0\]Thus, the set is orthonormal. This leaves showing they form a basis. For this, note that there’s an isomorphism between functions \(f: \{-1, +1\}^n \to \mathbb{R}\) and vectors in \(\mathbb{R}^{2^n}\), where the mapping is
\[f \mapsto \begin{bmatrix} f(0) \\ f(1) \\ \cdots \\ f(2^n - 1) \end{bmatrix}\]Isomorphisms preserve dimensionality, so the function space must have. must have dimension \(2^n\). There are \(2^n\) different subsets of \(\{1,\ldots,n\}\), so the parity functions must span the entire space. \(\blacksquare\)
Low Degree Approximation
Because the parity functions form an orthonormal basis, any function over \(\{-1, +1\}^n\) can be decomposed into a sum of parity functions, where the coefficient comes from the inner product. Let \(P(n)\) be the set of all subsets of \(1\) to \(n\). Then for any \(f\),
\[f(x) = \sum_{S \in P(n)} \langle f, f_S \rangle f_S(x)\]At this point we draw a comparison to Fourier series to build some intuition. Any periodic signal \(s(t)\) that’s sufficiently well behaved can be decomposed into a sum of sinusoids. The sinusoids \(\{1, \sin(t), \cos(t), \sin(2t), \cos(2t), \ldots \}\) form an orthonormal basis (which you can prove by throwing enough trigonometry identities at the problem.)
Intuitively, the earlier terms represent low-frequency components of the signal, and the later terms represent high-frequency components. To reconstruct \(s(t)\) exactly, we should do.
\[s(t) = a_0 + \sum_{n=1}^\infty a_n \sin(nx) + \sum_{n=1}^\infty b_n \cos(nx)\]We can’t compute this exactly because the sum is infinite. However, we can approximate the signal by truncating the sum at some maximum \(N\)
\[s(t) \approx a_0 + \sum_{n=1}^N a_n \sin(nx) + \sum_{n=1}^N b_n \cos(nx)\]This corresponds to dropping all the high frequency terms. Depending on the signal, this can still be a pretty good approximation.
Now, apply the same motivation to modeling \(f(x)\). We can write \(f\) as
\[f(x) = \sum_{S \in P(n)} \langle f, f_S \rangle f_S(x)\]Computing this exactly is intractable because there are \(2^n\) subsets. However, note that each term \(f_S(x_1,\ldots,x_n)\) is the product over a choice of indices. When \(S = \emptyset\), we get a \(0\)-degree constant term. When \(S\) is all numbers from \(1\) to \(n\), we get an \(n\)-degree term.
So, what if we limited to just the subsets \(S\) where $$ S \le d$$? This reduces the number of terms from \(2^n\) to \(O(n^d)\), turning an exponentially big problem into a polynomially big one.
The approximation above is the best \(d\)-degree approximation, if you measure approximation error between \(f\) and \(g\) as \(\langle f-g, f-g\rangle = \mathbb{E}[(f-g)^2]\). But the best approximation isn’t necessarily a sparse one, and once we enforce sparsity, is a good approximation guaranteed to still exist? It turns out the answer is yes. From the paper:
Fact 4. [Man94] Let \(f\) be such that \(L_1(f) \le s\). Then there exists \(g\) such that \(g\) is \(s^2/\epsilon\) sparse and \(\mathbb{E}[(f-g)^2] \le \epsilon\). Function \(g\) is constructed by taking all coefficients of magnitude \(\epsilon/s\) or larger in \(f\)’s expansion as a polynomial.
Not only does this guarantee that a sparse approximation \(g\) exists, if \(f\) is a \(d\)-degree polynomial, \(g\) is also guaranteed to be degree \(d\), because it is formed by selecting terms that already exist in \(f\), and those terms must be degree \(d\) or lower.
Polynomial Recovery Algorithm
Let’s assume we have evaluations of \(f\) at \(T\) different points \(x_i\). The first trick for fitting a sparse polynomial is a classic: formulate the problem as a linear regression problem in a larger feature space. There are \(O(n^d)\) features, one for each subset \(S\) with \(|S| \le d\). We want to find coefficients \(\alpha_S\) that minimize the mean squared error.
\[\sum_{i=1}^T (y_i - \sum_{|S| \le d} \alpha_S f_S(x_i))^2\]To encourage sparsity, we apply another classic solution - add L1 regularization.
\[\min_{\alpha_S} \lambda * \|\alpha\|_1 + \sum_{i=1}^T (f(x_i) - \sum_{|S| \le d} \alpha_S f_S(x_i))^2\]After solving this, we take the \(s\) coefficients with the largest magnitude, where \(s\) is a hyperparam for how sparse we want the polynomial to be.
(Yes, the hyperparam optimization method itself has hyperparams. It’s great.)
This gives the main subroutine, called Polynomial Sparse Recovery.
When called, it returns a sparse polynomial \(g\), and the set of all variables \(J\) that \(g\) depends on.
Runtime Bound For PSR
Let \(f\) be an \((\epsilon, d, s)\)-bounded function. Let \(\{H_S\}\) be an orthonormal polynomial basis bounded by \(K\). Then the PSR algorithm finds a \(g\) that’s within \(\epsilon\) of \(f\) in time \(O(n^d)\) with sample complexity \(T = \tilde{O}(K^2s^2 \log N/\epsilon)\).
In this statement, \(N\) is the size of the polynomial basis. For the special case of parity functions, we have \(K = 1\) and \(N = O(n^d)\), giving \(T = \tilde{O}(s^2 d \log n / \epsilon)\).
See paper for the proof.
Harmonica: The Full Algorithm
We assume \(f\) is the error %, and that we want to minimize it. Sample some values \(f(x_i)\) by training some models, and fit a sparse polynomial \(g\).
Since \(g(x)\) has at most \(s\) terms, each of which is degree \(d\), it can depend on at most \(sd\) different variables. With the right choice of \(s,d\), brute forcing all \(2^{sd}\) possibilities is small enough to be doable. Note that \((2^s)^d\) is likely dwarfed by the \(O(n^dT)\) runtime for doing the linear regression.
However, this only gives settings for the variables \(J\) that \(g\) depends on. Thus, the full algorithm proceeds in stages. At each stage, we fit a sparse polynomial \(g\), find the settings that minimize \(g\), freeze those hyperparams, then iterate in the smaller hyperparam space. Applying this multiple times shrinks the hyperparam space, until it’s small enough to use a baseline hyperparam optimization method, like random search or successive halving.
(In practice, the authors find the \(t\) best settings instead of just the best one, then pick one of those \(t\) at random in the later stages. For example, if \(n=30\), \(t = 4\) and stage 1 decided values for 10 hyperparams, in stage 2 we sample one of the \(t\) settings to deciding the first 10, then sample randomly for the remaining 20.)
Experiments
The problem used is CIFAR-10, trained with a deep Resnet model. The model has 39 hyperparams, deciding everything from learning rate to L2 regularization to connectivity of the network, batch size, and non-linearity. To make the problem more difficult, 21 dummy hyperparams are added, giving 60 hyperparams total.
Harmonica is evaluated with 300 samples at each stage (meaning 300 trained models), with degree \(d = 3\), sparsity \(s=5\), and \(t=4\). They perform 2-3 stages of PSR (depending on the experiment), and use successive halving as the base hyperparam optimizer.
Results Summary
Harmonica outperforms random search, Hyperband and Spearmint.
In this plot, the different between Harmonica 1 and Harmonica 2 is that in Harmonica 2, the final stage is given less time. This gives worse performance, but better runtime. The faster version (Harmonica 2) runs 5x faster than Random Search and outperforms it. The slower version (Harmonica 1) is still competitive with other approaches in runtime while giving much better results. Even when the base hyperparam optimizer is random search instead of successive halving, it outperforms random search in under half the time.
As Harmonica discovers settings for the important hyperparams, the average test error in the restricted hyperparam space drops. Each stage gives another drop in performance, but the drop is small between stage 2 and stage 3 - I’m guessing this is why they decided to only use 2 stages in the timing experiments.
To show that the chosen terms are meaningful, the coefficients in each stage are sorted by magnitude, to show which hyperparams influence error the most. Results are shown for a 3-stage Harmonica experiment, where sparsity \(s = 8\) for the first 2 stages and \(s = 5\) for the 3rd stage.
The minimizing settings often line up with human convention. Batch norm should be on (vs off), the activation function should be ReLU (vs sigmoid or tanh), and Adam should be used (vs SGD).
Many of the important hyperparams line up with convention - batch norm should be one, the activation function should be ReLU, and Adam is a better optimizer than SGD. Check the original paper for more details on the hyperparam definition.
One especially interesting term is the one discovered at stage 1-8. The dummy variable is extra and can be ignored. The authors say that in their code, if Optnet and Share gradInput are both true, training becomes unpredictable. The learned \(g\) is minimized if this term equals \(-1\), which happens if only one of Optnet and Share gradInput is true at a given time.
Advantages
In each stage of Harmonica, each model can be evaluated in parallel. By contrast, Bayesian optimization techniques are more difficult to parallelize because the derivation often assumes a single-threaded evaluation.
Enforcing hyperparam sparsity leads to interpretability - weights of different terms can be used to interpret which hyperparams are most important and least important. The approach successfully learned to ignore all dummy hyperparams.
Disadvantages
A coworker mentioned that Harmonica assumes a fixed computation budget for each stage, which doesn’t make it as good in an anytime setting (which is a better fit for Bayesian optimization.)
My intuition says that this approach works best when you have a large number of hyperparameters and want to initially narrow down the space. Each stage requires hundreds of models, and in smaller hyperparam spaces, evaluating 100 models with random search is going to enough tuning already. That being said, perhaps you can get away with fewer if you’re working in a smaller hyperparam space.
In the first two stages, Harmonica is run on a smaller 8 layer network, and the full 56 layer Resnet is only trained in the final stage. The argument is that hyperparams often have a consistent effect as the model gets deeper, so it’s okay to tune on a small network and copy the settings to a larger one. It’s not clear whether a similar trick was used in their baselines. If they weren’t, the runtime comparisons arne’t as impressive as they look.
The approach only works if it makes sense for the features to be sparse. This isn’t that big a disadvantage, in my experience neural net hyperparams are nearly independent. This is backed up by the Harmonica results - the important terms are almost all degree 1. (The authors observed that they did need \(d \ge 2\) for best performance.)
The derivation only works with discrete, binary features. Extending the approach to arbitrary discrete hyperparams doesn’t look too bad (just take the closest power of 2), but extending to continuous spaces looks quite a bit trickier.
Closing Thoughts
Overall I like this paper quite a bit. I’m not sure how practical it is - the “tune on small, copy to large” trick in particular feels a bit sketchy. But that should only affect the runtime of the algorithm, not the performance. I think it’s neat that it continues to work in empirical land.
One thing that’s nice is that because the final stage can use any hyperparam optimization algorithm, it’s easy to mix-and-match Harmonica with other approaches. So even if Harmonica doesn’t pan out in further testing, the polynomial sparse recovery subroutine could be an interesting building block in future work.
-
On The Perils of Batch Norm
This post is written for deep learning practitioners, and assumes you know what batch norm is and how it works.
If you’re new to batch norm, or want a refresher, a brief overview of batch norm can be found here.
* * *
Let’s talk about batch norm. To be more specific, let’s talk about why I’m starting to hate batch norm.
One day, I was training a neural network with reinforcement learning. I was trying to reproduce the results of a paper, and was having lots of trouble. (Par for the course in RL.) The first author recommended I add batch norm if I wasn’t using it already, because it was key to solving some of the environments. I did so, but it still didn’t work.
A few days later, I found out that when running my policy in the environment,
- I fed the current state in a batch of size \(1\).
- I ran the policy in train mode.
So I was normalizing my input to \(\vec{0}\) all the time. Which sounds like a pretty obvious issue, but thanks to reinforcement learning’s inherent randomness, it wasn’t obvious my input was always \(\vec{0}\).
I fixed it, and started getting the results I was supposed to get.
* * *
A few months later, an intern I was working with showed me a fascinating bug in his transfer learning experiments. He was using my code, which used TensorFlow’s MetaGraph tools. They let you take a model checkpoint and reconstruct the TF graph exactly the way it was at the time the checkpoint got saved. This makes it really, really easy to load an old model and add a few layers on top of it.
Unfortunately, MetaGraph ended up being our downfall. Turns out it doesn’t play well with batch norm! Model checkpoints are saved while the model is training. Therefore, the model from the meta checkpoint is always stuck in train mode. Normally, that’s fine. But batch norm turns it into a problem, because the train time code path differs from the test time code path. We couldn’t do inference for the same reason as the previous bug - we’d always normalize the input to \(\vec{0}\). (This is avoidable if you make the
is_training
flag a placeholder, but for structural reasons that wasn’t doable for this project.)I estimate we spent at least 6 hours tracking down the batch norm problem, and it ended with us concluding we needed to rerun all of the experiments we had done so far.
* * *
That same day (and I mean literally the same day), I was talking to my mentor about issues I was having in my own project. I had two implementations of a neural net. I was feeding the same input data every step. The networks had exactly the same loss, exactly the same hyperparameters, with exactly the same optimizer, trained with exactly the same number of GPUs, and yet one version had 2% less classification accuracy, and consistently so. It was clear that something had to be different between the two implementations, but what?
It was very lucky the MetaGraph issues got me thinking about batch norm. Who knows how long it would have taken me to figure it out otherwise?
Let’s dig into this one a bit, because this problem was the inspiration for this blog post. I was training a model to classify two datasets. For the sake of an example, let’s pretend I was classifying two digit datasets, MNIST and SVHN.
I had two implementations. In the first, I sample a batch of MNIST data and a batch of SVHN data, merge them into one big batch of twice the size, then feed it through the network.
In the second, I create two copies of the network with shared weights. One copy gets MNIST data, and the other copy gets SVHN data.
Note that in both cases, half the data is MNIST, half the data is SVHN, and thanks to shared weights, we have the same number of parameters and they’re updated in the same way.
Naively, we’d expect the gradient to be the same in both versions of the model. And this is true - until batch norm comes into play. In the first approach, the model is trained on one batch of MNIST data and SVHN data. In the second approach, the model is trained on two batches, one of just MNIST data, and one of just SVHN data.
At training time, everything works fine. But you know how the two networks have shared weights? The moving averages for dataset mean and variance were also shared, getting updated on both datasets. In the second approach, the top network is trained with estimated mean and variance from MNIST data. The bottom network is traide with estimated mean and variance with SVHN data. But because the moving average was shared across the two networks, the moving average converged to the average of MNIST and SVHN data.
Thus, at test time, the scaling and shifting that we apply is different from the scaling and shifting the network expects. And when test-time normalization differs from train-time normalization, you get results like this.
This plot is the top, median, and worst performance over 5 random seeds on one of my datasets. (This isn’t with MNIST and SVHN anymore, it’s with the two datasets I actually used.) When we do two networks with shared weights, not only was there a significant drop in performance, the variance of the output increased too.
Whenever individual minibatches aren’t representative of your entire data distribution, you can run into this problem. That means forgetting to randomize your input is especially bad with batch norm. It also plays a big role in GANs. The discriminator is usually trained on a mix of fake data and real data. If your discriminator uses batch norm, it’s incorrect to alternate between batches of all fake or all real data. Each minibatch needs to be a 50-50 mix of both.
(Aside: in practice, we got the best results by using two networks with shared weights, with separate batch norm variables for each network. This was trickier to implement, but it did boost performance.)
Batch Norm: The Cause of, And Solution To, All of Life’s Problems
You may have noticed a pattern in these stories.
I’ve thought about this quite a bit, and I’ve concluded that I’m never touching batch norm again if I can get away with it.
My reasoning comes from the engineering side. Broadly, when code does the wrong thing, it happens for one of two reasons.
- You make a mistake, and it’s obvious once you see it. Something like a mistyped variable, or forgetting to call a function.
- Your code has implicit assumptions about the behavior of other code it interacts with, and one of those assumptions breaks. These bugs are more pernicious, since it can take a while to figure out what assumption your code relied on.
Both mistakes are unavoidable. People make stupid mistakes, and people forget to check all the corner cases. However, the second class can be mitigated by favoring simpler solutions and reusing code that’s known to work.
Alright. Now: batch norm. Batch norm changes models in two fundamental ways.
- At training time, the output for a single input \(x_i\) depends on the other \(x_j\) in the minibatch.
- At testing time, the model runs a different computation path, because now it normalizes with the moving average instead of the minibatch average.
Almost no other optimization trick has these properties. That makes it easier to write code that only works when inputs are minibatch independent, or only works when train time and test time do the same thing. The code’s never been pushed that way. I mean, why would it? It’s not like somebody’s going to come up with a technique that breaks those assumptions, right?
Yes, you can treat batch norm as black box normalization magic, and it can even work out for a while. But in practice, the abstraction leaks, like all abstractions do, and batch norm’s idiosyncrasies make it leak a lot more than it should.
Look, I just want things to work. So every time I run into Yet Another Batch Norm issue, I get disappointed. Every time I realize I have to make sure all my code is batch-norm proof, I get annoyed this is even a thing I have to do. Ever since the one network vs two network thing, I’ve been paranoid, because it is only by dumb luck that I implemented the same model twice. The difference is big enough that the whole project could have died.
So…Why Haven’t People Ditched Batch Norm?
I’ll admit I’m being unfair. Minibatch dependence is indefensible - no one is going to argue that it’s a good quality for models to have. I’ve heard many people complain about batch norm, and for good reasons. Given all this, why is batch norm still so ubiquitous?
There’s a famous letter in computer science: Dijkstra’s Go To Statement Considered Harmful. In it, Dijkstra argues that the goto statement should be avoided, because it makes code harder to read, and any program that uses goto can be rewritten to avoid it.
I really, really wanted to title this post “Batch Norm Considered Harmful”, but I couldn’t justify it. Batch norm works too well.
Yes, it has issues, but when you do everything right, models train a lot faster. No contest. There’s a reason the batch norm paper has over 1400 citations, as of this post.
There are alternatives to batch norm, but they have their own trade-offs. I’ve had some success with layer norm, and I hear it makes way more sense with RNNs. I’ve also heard it doesn’t always work with convolutional layers.
Weight norm and cosine norm also sound interesting, and the weight norm paper said they were able to use it in a problem where batch norm didn’t work. I haven’t seen too much adoption of those methods though. Maybe it’s a matter of time.
Layer norm, weight norm, and cosine norm all fix the contracts that batch norm breaks. If you’re working on a new problem, and want to be brave, I’d try one of those instead of batch norm. Look, you’ll need to do hyperparam tuning anyways. When tuned well, I’d expect the difference between various methods to be pretty low.
(If you want to be extra brave, you could try batch renormalization. Unfortunately it still has moving averages that are only used at test time. EDIT (June 7, 2017): multiple people, including some of the paper authors, have told me this is incorrect. They’re right, ignore this paragraph.
In my case, I can’t afford to switch from batch norm. Previous state of the art used batch norm, so I know it works, and I’ve already paid my dues of getting batch norm to work with my model. I imagine other researchers are in similar spots.
It’s the Faustian bargain of deep learning. Faster training, in exchange for insanity. And I keep signing it. And so does everybody else.
Oh well. At least we didn’t have to sacrifice any goats.