Posts

Aug 18, 2017
Two Years Later
I’ll be honest: I completely forgot to prepare a post for my two year blogging anniversary until today. I’m vaguely disgusted that “two year blogging anniversary” is a valid English phrase, but sometimes languages are silly like that.

Ironically this is because I’ve been working on a larger post that’s been taking up most of my writing time. Hopefully I get that done soon.

I do feel I’ve been less productive this year than last year, blogging wise. Let’s see if that’s backed up empirically. Now, word count is a pretty bad metric for writing time, because in my experience most of writing is rewriting. But it’s the best metric I’ve got. Here’s a list of all posts I wrote this year, along with their word count.
```
 1244 2016-09-07-contradictions.markdown  
 1936 2016-10-18-politics.markdown  
 1750 2016-11-08-right-questions.markdown  
  895 2016-11-12-hybrid-argument.markdown  
 1384 2017-01-20-mh-2017.markdown  
  931 2017-01-25-research-comm.markdown  
  592 2017-02-12-default-arguments.markdown  
 4035 2017-02-22-wasserstein-gan.markdown  
 1827 2017-03-18-we-have-a-problem.markdown  
  834 2017-04-19-acss.markdown  
 1925 2017-04-26-perils-batch-norm.markdown  
 3443 2017-06-27-hyperparam-spectral.markdown  
 1613 2017-08-12-openai-dota-2.markdown  
22409 total  
```
Here’s the same list for last year.
```
  542 2015-08-20-things-i-did-in-my-khan-academy-internship.markdown  
 1744 2015-08-23-simulating-a-biased-coin-with-a-fair-one.markdown  
 2592 2015-08-25-perfectly-intelligent-mafia.markdown  
  599 2015-09-01-optimizing-process.markdown  
 1580 2015-09-09-the-other-two-envelope-problem.markdown  
 1988 2015-09-24-how-an-audio-play-about-a-time-traveling-pony-turned-me-into-a-fanboy.markdown  
 1703 2015-10-11-humor-proposal.markdown  
 1871 2015-11-02-friendship-games-geometry.markdown  
  465 2015-11-29-mlp-season-5-review.markdown  
 6376 2016-01-03-grad-school.markdown  
 1231 2016-01-20-mysteryhunt.markdown  
 1784 2016-01-27-deepmind-go.markdown  
 3101 2016-02-11-secure-computation.markdown  
 2647 2016-02-20-half-year.markdown  
 4537 2016-03-17-alphago-lsd.markdown  
 1353 2016-03-22-primes-pi.markdown  
 1323 2016-06-13-hillary-google.markdown  
  817 2016-07-03-email-time.markdown  
  714 2016-07-04-fruit.markdown  
 1361 2016-07-17-ml-sleep.markdown  
 1016 2016-08-03-pokemon-go-equality.markdown  
39344 total  
```
So, I wrote around 17,000 fewer words this year. What changed?

* * *

The most important change is that I’m been working full time the entire year. It’s been a lot harder for me to get side projects done than it was in college. This is a well-known experience with well-known explanations, but let me explain anyways.

In college, I didn’t have designated work times. I went to class, I had lunch and dinner, and the rest of the day was a blend between work and free time. It was very easy for me to work for 20 minutes, switch to playing Dustforce for 40 minutes, blog for 10 minutes, flowing back and forth depending on mood until 3 AM. Then I would wake up the next morning and start the process over again.

It was even easier for me to do this because I didn’t have many group projects, and I had very few homework groups or study groups. Overall there was no accountability on my schedule, and the blending was a natural consequence of that.

This wasn’t a bad thing. I wasn’t very focused, but at any time there was a chance I could switch to being productive, and the activation barrier between everything was much lower. Blogging slotted itself very naturally into my day. (Well, more like into my night.)

All that has changed. I have meetings, coworkers, managers. I’m working on a project with collaborators across three offices. Coordination is a lot easier if everybody’s working times overlap - one of my collaborators works from London, and he’s deliberately shifted his schedule forward because the only good meeting time is his evenings (which are our mornings in California).

It’s the same problem that most companies have, and there’s a shared solution: a homogenized schedule with similar working hours. When I was a student, I worked more, but the work was spread out over the entire week, like maple syrup on pancakes. Sure, sometimes I didn’t finish work until 2 AM. But I had relaxed enough during the afternoon to make this feel okay.

In a job, all the work gets compressed into a single block. I do take breaks, and I’m lucky enough to work at a company that doesn’t care how long your break is…as long as you get your work done. That nagging feeling of getting my work done means I don’t take a lot of breaks.

The end result is that my energy and motivation gets spent at work in one big burst. By the time I get home it’s hard to do much more than browse the Internet and play games.

I’ve thought a bit about whether I can reproduce the college working schedule, so to speak, but there are benefits to working similar hours to everybody else. Overall I don’t think there’s a lot of value in changing how I work.

* * *

Okay, time for more uplifting news! In last year’s post, I mentioned a few posts I said I would write this year! Let’s see how I did.

The post about Gunnerkrigg Court is going to happen. I swear it will.

Uhhhh, that post didn’t happen. Okay, next!

I wrote the start of a really, really stupid short story about dinosaurs. […] I hope that story sees the light of day, because the premise is so stupid that it might just wrap back around to brilliant.

That one didn’t happen either. Alright, what else did I say I was going to work on.

Oh, I said I was going to write a post about AI safety! That one is in progress! And not done! A bit far from done, in fact, but let’s round that up to a win. If failure is success rounded down, then surely success is failure rounded up.

To be serious for a bit: yes, I am a bit disappointed I haven’t written posts that I said I wanted to write. But I’m not actually that disappointed. It feels like I’m spent about as much time writing as I wanted to spend this year.

Well, maybe I’ve written a bit less than I wanted to.

Let’s say I’m more at peace and leave it at that.

In terms of life trajectory: I didn’t apply to PhD programs, mostly out of laziness. I should have applied, just to keep the option open, but I don’t see myself going to academia any time soon, unless my research interests or work drastically change.

I think I’ve changed less this year compared to last year. Again, this is a common sentiment. I found it easy to grow every year of college - it just happened naturally. I’ve been finding it harder to do so after leaving that environment.

When I think about what I would regret the most, it’s something like this: the worry that five years from now, I’ll look back on my life, and have no idea where five years of my life went. Something kind of like this.

Apple Bloom has sunk into a psychotic depression, because she knows this is how it all starts. Pretty soon they’ll all be living in different cities, and they’ll only see each other at New Year’s. […] Sure, they’ll start talking about things they did back in the day, like “Oh hey, remember that time we were happy and life felt like it had meaning?” And after a few drinks, it’ll feel like there’s still a connection between them, and they’ll share all their online handles and say they should get together sometime. […] Then the next day when they sober up, they’ll get right back to never getting around to it.

(Totally Legit Recap, “On Your Marks”. Warning: NSFW humor about My Little Pony.)

It’s that worry that pushes me to work on things, even when I have reasons not to.

Will pre-emptively thinking about quarter-life crises stop me from having a quarter-life crisis? Eh, maybe. We’ll find out.

I think I’ve rambled long enough. Two years down. Here’s to two more.
Comments
Aug 12, 2017
Emergency Post Because OpenAI Just Revealed Their Dota 2 1v1 Bot And I Have Opinions
See title!

The International is the tournament for Dota 2, and OpenAI just demoed their bot that went undefeated against several pro players in 1v1..

Is this awesome? Yes.

Does this represent a step forward in what machine learning can do? Maybe. That’s where it gets more complicated.

* * *

Let’s start with a broader question: why do machine learning researchers make game AIs?

I claim they do so for two reasons.
- To prove it’s possible for computers to beat humans at the game.
- To use the game as an environment for developing new machine learning techniques.
Now, there’s a healthy intersection between gamers and ML researchers, so these objectives certainly aren’t mutually exclusive. I bring it up because a number of game AIs are very tailored to the game they’re trying to solve. To judge whether a game AI is impressive or not, you need to know details behind how it works.

For example, there’s a Pokemon speedrun bot that tries to complete Pokemon Red as quickly as possible. It’s achieved record times, but it didn’t learn how to do it. Rather, a human designed a flowchart of what the bot should do, and it follows it until it achieves success.

Similarly, a while back there was some buzz about a computer that beat an expert fighter pilot at an aerial combat simulator. It rather predictably spawned many alarmist headlines. Read the details, however, and you find it’s mostly a rule-based AI (if this then that), learned through genetic algorithms, with some expert knowledge added to the training process. Still cool, but a lot of domain specific knowledge went into the problem.

Neither of these are super impressive from the standpoint of seeing what machine learning is capable of, because they both rely on a human providing specific domain knowledge.

In contrast, AlphaGo was impressive because it both solved a new game, and did so by combining supervised pretraining + reinforcement learning + Monte Carlo Tree Search in a way that hadn’t been done before. Generally, the best games for research are ones where developing new techniques is necessary to achieve superhuman performance, and Go was an example of this.

Now, DeepMind cares about StarCraft, because real-time is hard, partial information is hard, the wide variety of units means you have lots to learn, and most importantly the long-term strategy and planning is something that RL systems tend to lack.

Dota 2 fills a similar niche - also real-time, also partial information, also has tons of heroes, and also requires long-term planning to play well. With current approaches, we can’t solve StarCraft, and we can’t solve Dota 2. We need new ideas to teach computers to play these games at a competent level.

That’s why people care.

* * *

So far, OpenAI has not released many details behind the bot. Speculating on what the bot means for the field would be massively irresponsible, until more info comes out to provide context.

That being said…speculating how the bot might work is also massively fun. So let’s get into it!

First of all, the bot most likely only know how to play 1v1, Shadow Fiend vs Shadow Fiend. Rule of thumb: for demos and for papers, assume that if something isn’t mentioned or shown, it isn’t proven to work yet. This is not a strike against the work! There’s no reason to make the problem more difficult if you aren’t even sure the problem is solvable.

Next, from the header:

The bot learned the game from scratch by self-play, and does not use imitation learning or tree search.

So most likely, it’s reinforcement learning. Restricting the game to only 1v1 Shadow Fiend gives you a ton of nice things.
- The model only needs to understand how a single character in the game works.
- The same policy can be used for both players.
In many ways, this reminds me of the Super Smash Brothers Melee bot that beat several pro players - it can beat pros, but only if both players are Captain Falcon, and only if the players are P1 and P2, and only if they play on Battlefield. Cool? Yes. Limited? Also yes.

The existence of that SSBM bot makes me suspect there weren’t many breakthroughs behind this bot, just some grunt work on getting the system together and getting an off-the-shelf RL algorithm to train. I don’t play Dota (used to play League), but from what I’ve heard 1v1 laning is much more about character micro. I can see RL learning to deal with that - it’s the macro level play that I expect to be hard.

Assuming it’s an RL approach, there are several follow-up questions.
- How is game state getting represented?
- How are game actions getting represented?
- How often does the AI output actions?
- How is reward defined?
- What reinforcement learning algorithm is used?
- How long does it take to train?
The impressiveness of the achievement depends on the answer to these questions. Once again, we don’t know details yet. But I’m going to throw out some guesses. Hopefully they aren’t completely wrong, I’ll update when more information comes out.
- State representation: I assume OpenAI is collaborating with Valve on this. Valve has an official Dota 2 API, and I assume they’re using it, because why wouldn’t you? I mean, you could try to learn from raw pixels, but that would be ridiculously impressive, almost too much so. It’s much more plausible that it’s getting exact positions and health of all units within its vision, which certainly helps on the micro front. Human players have all that info too, but humans don’t have the attentional bandwidth to know everything down to the precision that an API could give you.
- Action representation: I would guess raw (x, y) locations for mouse clicks, the skill hotkeys, and special actions for clicking on units. In theory you only need mouse clicks, but if you’re already getting information from the game API, you might as well use the API for targeting functionality too. That way your model doesn’t have to how to learn how to click on units, it can focus on learning when clicking units is good and when clicking units is bad.
- Action frequency: No idea. They say that their APM is comparable to an average human player, which gives some indication of how often the network can send commands. I’m mostly curious on whether there are limits on APM, or if the neural net is slow enough to give human-level APM automatically.
This is one way real time problems can be more difficult - in the latency between polling for an action, computing an action, and sending it back to the game, the environment could be in a very different state. On the other hand, humans have the same problem - we have to process the game visuals, decide on an action, then send the command to our hands. Clearly we can deal with it fine, so maybe I’m overestimating how important this is.

Aside: in StarCraft, you can make much better usage of APM because you have an entire army to command. Let a StarCraft bot do thousands of actions per minute and you can get ridiculous things like this. In Dota 2, because there’s just one controllable character, there’s a cap on how much APM can improve your gameplay. This is mentioned in the blog post too, but I want to emphasize it. Whenever people post StarCraft bots, people invariable say they’ll wait for human level APM. In Dota 2’s case, I think it actually won’t matter that much - what matters is precision of movement and knowledge of health bars down to a single hit point. That’s where computers have the advantage.
- Reward definition: Here’s where I get really curious. I assume the reward isn’t just “1 if you win, -1 if you lose.” If it is, that would be insane. I mean, it’s not of the question - I think that barely kisses the realm of possibility. But I’d expect that the reward is shaped, giving partial reward based on EXP, gold, or health. (In RL, we say a reward is shaped if it gives small reward for doing things you expect to be useful. Reward shaping is one of the big ways to add domain knowledge to your problem.)
Generally, shaped rewards are easier to learn from than the pure “win/loss” signal, but only if the shaping is done well, and tuning the reward can be a huge pain. I’d actually rather have it be the win/loss reward, because if it was, it would be even cooler that the bot can learn so many complex behaviors from purely win/loss.

It could even be using the system from the Learning from Human Preferences paper - I could see it being useful for given partial signal to behaviors that are hard to define.
- Learning algorithm: Probably PPO. It works well, and it supports continuous control, which may be a better fit for mouse movement.
- Training time: “Two weeks real time”, except that says basically nothing for how many machines they used, whether they were able to run the game faster than real time, etc. For context, DeepMind says they got up to 2000 frames per second in a single thread for their StarCraft API. Let’s just say it took “a while” and leave it at that.
Comments
Jun 27, 2017
Read-through: Hyperparameter Optimization: A Spectral Approach
Similar to Wasserstein GAN, this is another theory-motivated paper with neat applications to deep learning. Once again, if you are looking for proof details, you are better off reading the original paper. The goal of this post is to give background and motivation.

Why Is This Paper Important?

Hyperparam optimization is a weird problem because everybody has to do it and nobody really likes it. The common options are either grid search, random search, or line search if your parameter is continuous. A mix of those does a pretty good job for most ML algorithms, even deep neural nets, which are a bit notorious for having many hyperparams.

However, I still think it’s worth keeping an eye on hyperparam optimization, for a couple reasons.
- If hyperparam optimization is better, more aspects of neural net design can be described as a hyperparam and automated. For example, the number of layers can be an explicit hyperparam, instead of something picked by hand. Right now, people don’t do this because it’s not worth doing for most problems, but the trend in research is that everybody pushes at the limits of their tools.
- Preliminary results like Neural Architecture Search suggests that there is learnable structure in neural net hyperparameters. It’s not yet applicable to anything, but I could see meta-learned models playing a bigger role in the future. At the same time, maybe we don’t need to throw a neural net at everything to get state-of-the-art results. (I like doing that as much as the next person, but machine learning doesn’t always have to mean deep learning.)
- There are simply a ton of interesting research questions that appear when each data point take hours or days to collect. Similar issues show up in robotics learning - robots are expensive and often need human supervision, which makes it hard to collect the ImageNet-scale datasets that neural nets are so good at.
- Better hyperparam optimization makes ML more accessible to others, because it reduces the amount of esoteric knowledge needed to get models to work. That’s good because people shouldn’t need to know that esoteric knowledge to solve their problems. It’s bad for me because weird deep learning expertise is one of the few things I’m good at, and I’m not sure I like knowing it could be obsolete down the line. All the more reason to pay attention.
Some Quick Background on Hyperparam Optimization

For two approachable introductions to the subject, I highly recommend two blog posts from Ben Recht’s blog, linked here and here. The key takeaways are
- Random search is a simple, deceptively strong baseline.
- Bayesian optimization can outperform random search by a bit.
- However, if you give random search twice the time, it beats a lot of Bayesian optimization methods. Yes, it’s more compute time, but sometimes buying more machines is easier than integrating clever ideas into your software stack.
- Successive halving (SH) is a nice extension of random search that lets you run many more hyperparams in the same amount of time, by pruning poor hyperparams early on. You have to be a little careful with how aggressively you prune, but in practice it does pretty well.
  - SH was later extended to the Hyperband algorithm, described here. Hyperband is one of the baseline methods used for comparison. Spearmint, a Bayesian optimization method, is the other main comparision.
Problem Setting

The paper focuses on discrete hyperparameter optimization. Generally, discrete optimization is trickier, and continuous problems can be approximated by discretizing the space, so I think this is an okay choice to make.

The paper further assumes all hyperparams are binary variables with value $-1$ or $+1$. Note you can still do things like allocate $3$ hyperparams for the $2^3 = 8$ different learning rates you want to try.

Let’s say there are $n$ hyperparams. Then given all these assumptions, we want to learn a function $f: \{-1,+1\}^n \to \mathbb{R}$, where the input is the hyperparam setting, and the output is the performance of the trained model. In the upcoming sections, we’ll use two different notations.
- $f(x)$, in which case $x = (x_1,x_2,x_3,\ldots, x_n)$ is some $n$-bit vector.
- $f(x_1,x_2,\ldots,x_n)$, in which case we interpret $f$ as a polynomial of $n$ variables, where the variables equal either $-1$ or $+1$.
Polynomial Learning as Compressed Sensing

This is the key guiding framework of the algorithm. Viewing $f(x)$ as a polynomial, we learn a sparse, low-degree polynomial $g(x)$ that approximates $f$.

What’s the intuition for why you’d want to do this? Well, to be honest, I have no idea why you’d decide to look at compressed sensing, instead of the other methods out there. But I can explain why it makes sense to me, after the fact.

To do better than random search, two things must be true.
- Your problem must have learnable structure.
- Your hyperparam optimization methods needs to discover and exploit that latent structure.
All of machine learning can be seen as fitting low-dimensional functions to inputs in high-dimensional space. This gets at why fitting a low-degree polynomial would make sense.

But then why include sparsity? Remember that we’re going to have very few data points, because each data point requires training a neural net to convergence. Sparsity helps with this by reducing the capacity of the polynomial fitting process - it stops it from putting a little bit of weight on thousands of different terms, which is one of the easiest ways to overfit. As a side effect, sparsity tends to make interpreting models easier.

(Aside: this is one of the common complaints about deep learning - it’s hard to interpret much out of the sum of 100 activations. But in practice, most deep learning problems have a lot of data, and enforcing sparsity just ends up hurting performance.)

When deriving the algorithm, we’ll assume that evaluating $f(x)$ for a single $x$ takes “a long time, but not too long”. By this, we mean functions run in $\Omega(n^d)$ time for a somewhat large $d$, but still run faster than things with exponential runtime.

The Parity Function Basis

To fit a low-degree polynomial, first note that the Boolean functions $f: \{-1,+1\}^n \to \mathbb{R}$ form an inner product space. Define the inner product as
\[\langle f, g\rangle = \mathbb{E}_{x} [f(x)g(x)]\]
where the expectation is the uniform distribution over $n$-bit vectors. We can verify this satisfies the inner product definitions.
- Symmetry:
  \[\langle f, g\rangle = \mathbb{E}_{x} [f(x)g(x)] =\mathbb{E}_{x} [g(x)f(x)] = \langle g,f \rangle\]
- Linearity: for some constant $a$,
  \[\langle af, g \rangle = \mathbb{E}_{x} [af(x)g(x)] = a \mathbb{E}_x [f(x)g(x) = a\langle f, g\rangle\]
  For functions $f_1, f_2$,
  \[\langle f_1+f_2, g\rangle = \mathbb{E}_x[(f_1(x) + f_2(x))g(x)] = \mathbb{E}_x [f_1(x)g(x)] + \mathbb{E}_x [f_2(x)g(x)] = \langle f_1, g\rangle + \langle f_2, g\rangle\]
- Positive definiteness.
  \[\langle f, f\rangle = \mathbb{E}_x[f(x)^2] \ge 0\]
For $\langle f, f\rangle$ to be $0$, note $f(x)^2 \ge 0$, so the only way for $\mathbb{E}_x[f(x)]$ to equal $0$ is if $f(x) = 0$ everywhere.

Next, we want to find an orthonormal basis for this inner product space. For these functions, there’s a well known example. Let $S \subset [n]$ be a subset of indicies $1$ to $n$. Define $f_S(x)$ as
\[f_S(x) = \prod_{i \in S} x_i\]
The functions $\{f_S\}$ (known as the parity functions) form an orthonormal basis.

Why is this an orthonormal basis? A set $\{f_1, f_2, \ldots, f_n\}$ is orthonormal if it satisfies two properties.
- For every $f_i$, $\langle f_i, f_i \rangle = 1$.
- For every $f_i, f_j$ where $i \neq j$, $\langle f_i, f_j \rangle = 0$.
Now, we show the parity functions have this property.

In the upcoming sections, sometimes $i$ means an index from $1$ to $n$, and sometimes $i$ means a number from $0$ to $2^{n-1}$ that should be interpreted as $n$ bits of binary. It should be inferable from context.
- By definition, $\langle f_i, f_i \rangle = \mathbb{E}[f_i(x)^2]$. Note that $f_i(x) = -1$ or $f_i(x) = 1$, depending on how many matching bits there are between $i$ and $x$. In either case, $f_i(x)^2 = 1$, so $\langle f_i, f_i \rangle = 1$.
- For any two different subsets $S_1, S_2$, there exists some $i$ in one, but not the other. Without loss of generality, let $i \in S_1$ and $i \notin S_2$. We can write $f_{S_1}(x)$ as
\[f_{S_1}(x) = x_i \prod_{j \in S_1 - \{i\}} x_j\]
Take the expectation over $x$, and factor it into the conditional expectation on $x_i$.
\[\langle f_{S_1}, f_{S_2} \rangle = \mathbb{E}_x[f_{S_1}(x)f_{S_2}(x)] = \mathbb{E}_{x_i}\left[x_i\mathbb{E}_{x|x_i}\left[f_{S_2}(x)\prod_{j \in S_1 - \{i\}} x_j\right]\right]\]
The inner expectation equals $c$, for some constant $c$ that doesn’t depend on $x_i$. Thus we can write the expectation as
\[\mathbb{E}_{x_i}\left[x_i \cdot c\right] = \frac{1}{2}\cdot c + \frac{1}{2} \cdot -c = 0\]
Thus, the set is orthonormal. This leaves showing they form a basis. For this, note that there’s an isomorphism between functions $f: \{-1, +1\}^n \to \mathbb{R}$ and vectors in $\mathbb{R}^{2^n}$, where the mapping is
\[f \mapsto \begin{bmatrix} f(0) \\ f(1) \\ \cdots \\ f(2^n - 1) \end{bmatrix}\]
Isomorphisms preserve dimensionality, so the function space must have. must have dimension $2^n$. There are $2^n$ different subsets of $\{1,\ldots,n\}$, so the parity functions must span the entire space. $\blacksquare$

Low Degree Approximation

Because the parity functions form an orthonormal basis, any function over $\{-1, +1\}^n$ can be decomposed into a sum of parity functions, where the coefficient comes from the inner product. Let $P(n)$ be the set of all subsets of $1$ to $n$. Then for any $f$,
\[f(x) = \sum_{S \in P(n)} \langle f, f_S \rangle f_S(x)\]
At this point we draw a comparison to Fourier series to build some intuition. Any periodic signal $s(t)$ that’s sufficiently well behaved can be decomposed into a sum of sinusoids. The sinusoids $\{1, \sin(t), \cos(t), \sin(2t), \cos(2t), \ldots \}$ form an orthonormal basis (which you can prove by throwing enough trigonometry identities at the problem.)

Intuitively, the earlier terms represent low-frequency components of the signal, and the later terms represent high-frequency components. To reconstruct $s(t)$ exactly, we should do.
\[s(t) = a_0 + \sum_{n=1}^\infty a_n \sin(nx) + \sum_{n=1}^\infty b_n \cos(nx)\]
We can’t compute this exactly because the sum is infinite. However, we can approximate the signal by truncating the sum at some maximum $N$
\[s(t) \approx a_0 + \sum_{n=1}^N a_n \sin(nx) + \sum_{n=1}^N b_n \cos(nx)\]
This corresponds to dropping all the high frequency terms. Depending on the signal, this can still be a pretty good approximation.

Now, apply the same motivation to modeling $f(x)$. We can write $f$ as
\[f(x) = \sum_{S \in P(n)} \langle f, f_S \rangle f_S(x)\]
Computing this exactly is intractable because there are $2^n$ subsets. However, note that each term $f_S(x_1,\ldots,x_n)$ is the product over a choice of indices. When $S = \emptyset$, we get a $0$-degree constant term. When $S$ is all numbers from $1$ to $n$, we get an $n$-degree term.

So, what if we limited to just the subsets $S$ where $$ S \le d$$?

\[f(x) \approx \sum_{S \in P(n): |S| \le d} \langle f, f_S \rangle f_S(x)\]
This reduces the number of terms from $2^n$ to $O(n^d)$, turning an exponentially big problem into a polynomially big one.

The approximation above is the best $d$-degree approximation, if you measure approximation error between $f$ and $g$ as $\langle f-g, f-g\rangle = \mathbb{E}[(f-g)^2]$. But the best approximation isn’t necessarily a sparse one, and once we enforce sparsity, is a good approximation guaranteed to still exist? It turns out the answer is yes. From the paper:

Fact 4. [Man94] Let $f$ be such that $L_1(f) \le s$. Then there exists $g$ such that $g$ is $s^2/\epsilon$ sparse and $\mathbb{E}[(f-g)^2] \le \epsilon$. Function $g$ is constructed by taking all coefficients of magnitude $\epsilon/s$ or larger in $f$’s expansion as a polynomial.

Not only does this guarantee that a sparse approximation $g$ exists, if $f$ is a $d$-degree polynomial, $g$ is also guaranteed to be degree $d$, because it is formed by selecting terms that already exist in $f$, and those terms must be degree $d$ or lower.

Polynomial Recovery Algorithm

Let’s assume we have evaluations of $f$ at $T$ different points $x_i$. The first trick for fitting a sparse polynomial is a classic: formulate the problem as a linear regression problem in a larger feature space. There are $O(n^d)$ features, one for each subset $S$ with $|S| \le d$. We want to find coefficients $\alpha_S$ that minimize the mean squared error.
\[\sum_{i=1}^T (y_i - \sum_{|S| \le d} \alpha_S f_S(x_i))^2\]
To encourage sparsity, we apply another classic solution - add L1 regularization.
\[\min_{\alpha_S} \lambda * \|\alpha\|_1 + \sum_{i=1}^T (f(x_i) - \sum_{|S| \le d} \alpha_S f_S(x_i))^2\]
After solving this, we take the $s$ coefficients with the largest magnitude, where $s$ is a hyperparam for how sparse we want the polynomial to be.

(Yes, the hyperparam optimization method itself has hyperparams. It’s great.)

This gives the main subroutine, called Polynomial Sparse Recovery.

When called, it returns a sparse polynomial $g$, and the set of all variables $J$ that $g$ depends on.

Runtime Bound For PSR

Let $f$ be an $(\epsilon, d, s)$-bounded function. Let $\{H_S\}$ be an orthonormal polynomial basis bounded by $K$. Then the PSR algorithm finds a $g$ that’s within $\epsilon$ of $f$ in time $O(n^d)$ with sample complexity $T = \tilde{O}(K^2s^2 \log N/\epsilon)$.

In this statement, $N$ is the size of the polynomial basis. For the special case of parity functions, we have $K = 1$ and $N = O(n^d)$, giving $T = \tilde{O}(s^2 d \log n / \epsilon)$.

See paper for the proof.

Harmonica: The Full Algorithm

We assume $f$ is the error %, and that we want to minimize it. Sample some values $f(x_i)$ by training some models, and fit a sparse polynomial $g$.

Since $g(x)$ has at most $s$ terms, each of which is degree $d$, it can depend on at most $sd$ different variables. With the right choice of $s,d$, brute forcing all $2^{sd}$ possibilities is small enough to be doable. Note that $(2^s)^d$ is likely dwarfed by the $O(n^dT)$ runtime for doing the linear regression.

However, this only gives settings for the variables $J$ that $g$ depends on. Thus, the full algorithm proceeds in stages. At each stage, we fit a sparse polynomial $g$, find the settings that minimize $g$, freeze those hyperparams, then iterate in the smaller hyperparam space. Applying this multiple times shrinks the hyperparam space, until it’s small enough to use a baseline hyperparam optimization method, like random search or successive halving.

(In practice, the authors find the $t$ best settings instead of just the best one, then pick one of those $t$ at random in the later stages. For example, if $n=30$, $t = 4$ and stage 1 decided values for 10 hyperparams, in stage 2 we sample one of the $t$ settings to deciding the first 10, then sample randomly for the remaining 20.)

Experiments

The problem used is CIFAR-10, trained with a deep Resnet model. The model has 39 hyperparams, deciding everything from learning rate to L2 regularization to connectivity of the network, batch size, and non-linearity. To make the problem more difficult, 21 dummy hyperparams are added, giving 60 hyperparams total.

Harmonica is evaluated with 300 samples at each stage (meaning 300 trained models), with degree $d = 3$, sparsity $s=5$, and $t=4$. They perform 2-3 stages of PSR (depending on the experiment), and use successive halving as the base hyperparam optimizer.

Results Summary

Harmonica outperforms random search, Hyperband and Spearmint.

In this plot, the different between Harmonica 1 and Harmonica 2 is that in Harmonica 2, the final stage is given less time. This gives worse performance, but better runtime. The faster version (Harmonica 2) runs 5x faster than Random Search and outperforms it. The slower version (Harmonica 1) is still competitive with other approaches in runtime while giving much better results. Even when the base hyperparam optimizer is random search instead of successive halving, it outperforms random search in under half the time.

As Harmonica discovers settings for the important hyperparams, the average test error in the restricted hyperparam space drops. Each stage gives another drop in performance, but the drop is small between stage 2 and stage 3 - I’m guessing this is why they decided to only use 2 stages in the timing experiments.

To show that the chosen terms are meaningful, the coefficients in each stage are sorted by magnitude, to show which hyperparams influence error the most. Results are shown for a 3-stage Harmonica experiment, where sparsity $s = 8$ for the first 2 stages and $s = 5$ for the 3rd stage.

The minimizing settings often line up with human convention. Batch norm should be on (vs off), the activation function should be ReLU (vs sigmoid or tanh), and Adam should be used (vs SGD).

Many of the important hyperparams line up with convention - batch norm should be one, the activation function should be ReLU, and Adam is a better optimizer than SGD. Check the original paper for more details on the hyperparam definition.

One especially interesting term is the one discovered at stage 1-8. The dummy variable is extra and can be ignored. The authors say that in their code, if Optnet and Share gradInput are both true, training becomes unpredictable. The learned $g$ is minimized if this term equals $-1$, which happens if only one of Optnet and Share gradInput is true at a given time.

Advantages

In each stage of Harmonica, each model can be evaluated in parallel. By contrast, Bayesian optimization techniques are more difficult to parallelize because the derivation often assumes a single-threaded evaluation.

Enforcing hyperparam sparsity leads to interpretability - weights of different terms can be used to interpret which hyperparams are most important and least important. The approach successfully learned to ignore all dummy hyperparams.

Disadvantages

A coworker mentioned that Harmonica assumes a fixed computation budget for each stage, which doesn’t make it as good in an anytime setting (which is a better fit for Bayesian optimization.)

My intuition says that this approach works best when you have a large number of hyperparameters and want to initially narrow down the space. Each stage requires hundreds of models, and in smaller hyperparam spaces, evaluating 100 models with random search is going to enough tuning already. That being said, perhaps you can get away with fewer if you’re working in a smaller hyperparam space.

In the first two stages, Harmonica is run on a smaller 8 layer network, and the full 56 layer Resnet is only trained in the final stage. The argument is that hyperparams often have a consistent effect as the model gets deeper, so it’s okay to tune on a small network and copy the settings to a larger one. It’s not clear whether a similar trick was used in their baselines. If they weren’t, the runtime comparisons arne’t as impressive as they look.

The approach only works if it makes sense for the features to be sparse. This isn’t that big a disadvantage, in my experience neural net hyperparams are nearly independent. This is backed up by the Harmonica results - the important terms are almost all degree 1. (The authors observed that they did need $d \ge 2$ for best performance.)

The derivation only works with discrete, binary features. Extending the approach to arbitrary discrete hyperparams doesn’t look too bad (just take the closest power of 2), but extending to continuous spaces looks quite a bit trickier.

Closing Thoughts

Overall I like this paper quite a bit. I’m not sure how practical it is - the “tune on small, copy to large” trick in particular feels a bit sketchy. But that should only affect the runtime of the algorithm, not the performance. I think it’s neat that it continues to work in empirical land.

One thing that’s nice is that because the final stage can use any hyperparam optimization algorithm, it’s easy to mix-and-match Harmonica with other approaches. So even if Harmonica doesn’t pan out in further testing, the polynomial sparse recovery subroutine could be an interesting building block in future work.
Comments