Unboxing the blackbox

Unboxing the blackbox series

I have been fascinated with the whole machine learning stuff for a while, but it really, really bugs me that the only way to access its magical wonders is to use an ML framework, aka code someone else has built. It feels like a blackbox, and I couldn't do anything that's outside of the framework at all, so my capabilities were pretty limited. However, as I dive in deeper and deeper, I begin to have control over these systems in great detail, and have benefited a lot from it and now I can truly manipulate these systems to my own will, and be able to actually do interesting experiments and gaining insights. So in this series, I hope to deconstruct pretty much everything about PyTorch and explain how all functionalities in the framework can be replicated. The only aspect of PyTorch we are taking for granted is the ability to manipulate tensors very quickly using GPUs.

Basics

This covers a lot of simple black boxes. We will be recreating nn.ReLU, nn.Linear, nn.CrossEntropyLoss, nn.BCEWithLogitsLoss, an optimizer and will dig into nn.CrossEntropyLoss alternatives.

Convolutions

Here, I briefly review how convolutions work, try a vanilla implementation and try to gradually speed it up. Then, I bunched everything up into the form of a PyTorch module, which can be used directly in networks. Finally, I benchmark our implementation with the state of the art.

Transpose convolution, custom stride and padding

Pretty much what it says in the title. We will be upgrading the normal convolution to handle custom stride and padding. Then, we will try to create the transpose convolution function from scratch, see what does it do and we will conclude with benchmarks with the state of the art.

Autograd

We will be recreating one of the defining features of PyTorch by first going through what it does, then observe and reverse engineer the behaviors of Autograd. Finally, we will really implement it.

Adam optimizer

We will be creating a very badass optimizer, Adam, from scratch. We will devise methods to make sure what we come up with is actually Adam, and we will also analyze some behaviors of it.

Diving deep

Diving deep series

This is where we will take a known method, and experiment the shit out of them, to hopefully squeeze out any potential. It's just that when it comes to these things, I'm really stupid and I really don't have a feeling for how these systems work at all, so I just want to discover new things and hopefully can share that with you.

The way we will approach and set up these experiments will be formulize everything in terms of events happening in the middle of 2 very extreme points. Just think about the 2 most extreme possible points with some directly opposing characteristics, then consider everything in between. Extreme points by definition are pretty extreme, so it's unlikely that an existing theory will be able to explain both of them at the same time. However, because our universe is consistent, they can still coexist and there must be another underlying theory that can explain both extreme points. The hope is that we can recognize patterns in the universe's internal representation when it throws at us some interesting phenomenon. In the process of doing that, we will create specific choke points along the way from 1 extreme to another to discover even more rules and patterns.

Learning simple functions: Initial experimentation

Here, I try to reduce the ML problem to arguably the simplest task possible: predicting damn simple functions. The hope is that by doing this, very closely observe what's going on and experiment the shit out of it, we can discover some other fundamental truths about intelligence, and we can advance the field. I started out setting things up, see dynamics of untrained networks, training process, training process in slow-mo. Then we will be discussing about an interesting phenomenon, and concludes with transfer learning experiments.

Learning simple functions: Network exhaustion point

In this post, we will try to address a lot of questions raised in the post above. We start off with making tools to show the network's dynamics, to make it more visible to us. Then, we will our hardest to fuck the network up, making it virtually useless. Then we do 2 more experiments to find out what the network exhaustion point is, does it work like metal fatigue, and see if we can discover something new.

Brain capacity

We will try to compare the scale of the intelligent systems that we build, explore common characteristics, see over thermal limits, training effectiveness and when should be expect to be able to train whole human brains.

Convolution output channels

Just as advertised, we will be training a CNN to reasonable level, try to pass through interesting images and see the convolutional layer's outputs, and see what insights can we gain.

Convolution kernel minimizer

Here, we will be using a regular CNN, changing the passed in image slowly so that all kernels of 1 convolution layer gets as low as possible, and see how the images change.

Convolution performance and intuition

Here, I hope to explain and show the performance gains of convolution layers and other benefits that they provide that are better than fully connected layers.

Learning rate

Here, we will be looking closer to what learning rate is the best at training for a network that fits simple functions and one that classify CIFAR-10. We will also be talking about dimensionality and goes deep into some of the weird behaviors of learning rates.

Gradient fluctuations

Here, we will look at how the activation distributions will look like after a few affine layers. We will discuss what are the goals to keep the network training. This sounds super vague, but we're essentially multiplying many tensors and plot their mean and standard deviations.

And here are some ideas I hope to do in the future:

  • Batch normalization: In the original paperthat introduces batch normalization, they discover a concept which they called it "covariate shift". This is actually the behavior we see in the Learning basic functions: Initial experimentation post. When there's a distributional shift, the network kinda collapses its learned geometry and just try to drag itself over to a high entropy zone, then reestablish the geometry. The paper proposes to make this process easier for the network by making it easy to adapt to the distributional shift without collapsing its learned geometry. We will be testing this on the damn simple function network, and see what is the general distribution of the layers.
  • Network stability: While thinking about batch normalization, I get a feeling that the problem can manifests itself in a whole lot of places and can slip into virtually every part of the network as they are usually porous. I also get a feeling that the problem is much more fundamental than covariate shift, and the actual problem we're facing is that deep networks are inherently unstable, so this unstable behavior is what gets interpreted as covariate shift. So I hope to describe this unstable behavior better, again with a lot of experimentation. If possible, I hope to discuss the domain adaptation method mentioned at the beginning of the batch normalization paper.
  • Dropout vs L2 regularization: When I first heard about dropout (original paper) as an actual method that people use, I could not understand why it works at all. If we have a dropout rate of 20%, then the evaluation mode's internal weights should be 1/0.8 times larger than the training mode. This should just fuck the hell up in the network's own representation in the default case, and work fine only if the network is fighting the behavior, which at least should degrade the performance. To add to this, the default dropout percentage is 50%, which is so incredibly large. So I'm very curious as to what happens when vary that number around. So how does it work at all? Furthermore, L2 regularization actually feels like it should work, because it actively smooths out the weight landscape, making it easier for the network to actually find a minimum. However, L2 regularization isn't used much at all every where I look. So what's up with that? Do people just find out the performance is bad? Also, dropout is sort of viewed as random forests of networks, but only with 1/2 the cost, not 1/10 the cost like it normally is.
  • Dropout position in CNNs: Typically, dropout is used in the linear layers after the activation and after all convolutions have concluded. What happens when we mix those around?
  • Up sampling of generated images: The idea is that we have a network for generating images at 25x25, then we can freeze that layer, then train it on a dataset at 50x50 using a simple inverse convolution layer (with batch normalization in the way of course). Then unfreeze the previous layers, then train the whole thing. Do this again and again, to see the limits of this.
  • Using GANs to extract features: It seems like I can extract internal information from the GANs to use as a pre-trained model for a classifier. So verify this behavior. Also it feels like we can extract information directly from the GAN pair without having to put in inside a transfer learning situation. If this is truly the case, then we may have actually cracked the explainable AI problem that sort of looms over the field for many years.
  • Verify initialization behavior seen in Yoshua Bengio's "Understanding the difficulty of training deep feedforward neural networks". It seems straightforward enough, but still, vary the parameters around to see if there're anything interesting.
  • Using pre-trained networks as opposed to creating it from scratch in DCGAN architecture: DCGAN is basically a GAN with deep convolutions in them, described in the original paper. They expressed pain in adjusting the right hyperparameters, so should it be easier to just slap resnet34 or something in front?
  • GAN interpolation: I'm interested in testing 2 situations out. 1 is a network taking in a single integer as input and generates a 28x28 image. The discriminator will only be given 10 images, each is a digit. Then see whether the generator essentially learns to copy the 10 images seen by the discriminator. Also, at that point, vary the initial number passed in so that they're in the middle of integers. See if the network actually understand the concept of "between 2 and 3", for example. An extended experiment is to not pass in just a number, but to pass in a 100 feature latent vector, but keep the first parameter as the digit information. Then try to reduce those 100 features down slowly, and watch what happens.
  • GAN, where 1 party is super strong, but is frozen: It will be interesting to see what happens when either the generator or the discriminator is super strong from the get go (trained from another GAN pair earlier and achieved superhuman results), but is frozen and can't keep up with the opposing party if that party even manages to come close in performance. This will serve as the upper extreme point for GAN use cases, and the above should be the lower extreme point, so we can test everything in between and see what makes sense and what's not.
  • Actually analyzes GANs when they approach superhuman level: In Ian Goodfellow's original paper about GANs, he mentioned that they don't usually converge to the theoretical maximum predicted by game theory. So what actually happens when they get to superhuman level and is saturated? What does the internal representation looks like?
  • Analyzes GANs unstable behavior: Also, GANs actually seem like they are extremely unstable. I mean, why won't the generator just shuts off the input by driving the input kernels to 0, and always learn to generate the exact same images as the supposedly "real" one? Suppose it has an internal capacity to store 10 images, then the discriminator will just classify those 10 images as fake, sacrificing those 10 images, because there are probably thousands of real images, so it's an infeasible policy for the generator to just remember 10 images. So I hope to figure out how many images can the generator actually stores within its internal representation, and vary the real samples seen by the discriminator to see whether this kind of rote memorization is actually what's going on here.
  • Make GANs able to understand the physical world better: Predicting physics phenomenon is still sort of my wet dream. If we can do that, then it seems like we have actually created a primitive version of a general agent, which is pretty huge. So the setup is to make it generate a series of physical phenomenons through time, kinda like how we did the learn simple function thingy from before. And if it truly understands the world, then with the experiment way above about explainable AI, can we extract knowledge from it?
  • Another distributional shift case: I was wondering what happens if we train a simulated robot to do something with joints and without gravity. Then do the same thing with gravity on. Will it be able to learn that it has to deal with gravity?
  • Try to devise games that are difficult and very unintuitive for humans to play, then try to publish it and see how people figured that out. This takes inspiration from the fact that DeepMind's DQN system can't really play Montezuma's Revenge. The reason being, it takes lots of steps to actually have feedback on how well it's doing, so naturally it can't play it as well. Humans seem likely to have the same weakness, so try to break humans, and see if there're anything interesting observations.
  • Actually try out some actual phenomenon inside regular human brains, to see do they actually work, and again, to get a feel for them
  • In a previous post, when we were creating our own optimizer, the network can train when loss function is -log(x) or 1/sqrt(x) only. 1/x and 1-x doesn't really work. However, Adam can train the network when the loss function is 1/x, or x^-3, or even x^-5. That's pretty impressive, so try to explain why can it do that, cause Adam here seems to be much stable than our dumb implementation.
  • Dive into how xgboost works. I heard it's good for tabular data.
  • Try to do the auto encoder thingy, where you squish an input to a lower dimension, then expand that again to a higher dimension. So, let's start with 1k -> 500 -> 1k. Measure its characteristics. Then do 1k -> 500 -> 100 -> 500 -> 1k. The hope is that by doing this, we can find the bare minimum dimension that needs to encode a specific task/domain. Collect that information across a variety of domain, to see if we can conclude something.