Diving deep series
This is where we will take a known method, and experiment the shit out of them, to hopefully
squeeze out any potential. It's just that when it comes to these things, I'm really stupid and I
really don't have a feeling for how these systems work at all, so I just want to discover new
things and hopefully can share that with you.
The way we will approach and set up these experiments will be formulize everything in terms of
events happening in the middle of 2 very extreme points. Just think about the 2 most extreme
possible points with some directly opposing characteristics, then consider everything in
between. Extreme points by definition are pretty extreme, so it's unlikely that an existing
theory will be able to explain both of them at the same time. However, because our universe is
consistent, they can still coexist and there must be another underlying theory that can explain
both extreme points. The hope is that we can recognize patterns in the universe's internal
representation when it throws at us some interesting phenomenon. In the process of doing that,
we will create specific choke points along the way from 1 extreme to another to discover even
more rules and patterns.
Here, I try to reduce the ML problem to arguably the simplest task possible: predicting damn
simple functions. The hope is that by doing this, very closely observe what's going on and
experiment the shit out of it, we can discover some other fundamental truths about intelligence,
and we can advance the field. I started out setting things up, see dynamics of untrained
networks, training process, training process in slow-mo. Then we will be discussing about an
interesting phenomenon, and concludes with transfer learning experiments.
In this post, we will try to address a lot of questions raised in the post above. We start off
with making tools to show the network's dynamics, to make it more visible to us. Then, we will
our hardest to fuck the network up, making it virtually useless. Then we do 2 more experiments
to find out what the network exhaustion point is, does it work like metal fatigue, and see if we
can discover something new.
We will try to compare the scale of the intelligent systems that we build, explore common
characteristics, see over thermal limits, training effectiveness and when should be expect to be
able to train whole human brains.
Just as advertised, we will be training a CNN to reasonable level, try to pass through
interesting
images and see the convolutional layer's outputs, and see what insights can we gain.
Here, we will be using a regular CNN, changing the passed in image slowly so that all kernels of
1 convolution layer gets as low as possible, and see how the images change.
Here, I hope to explain and show the performance gains of convolution layers and other benefits
that they provide that are better than fully connected layers.
Here, we will be looking closer to what learning rate is the best at training for a network that
fits simple functions and one that classify CIFAR-10. We will also be talking about
dimensionality and goes deep into some of the weird behaviors of learning rates.
Here, we will look at how the activation distributions will look like after a few affine layers.
We will discuss what are the goals to keep the network training. This sounds super vague, but
we're essentially multiplying many tensors and plot their mean and standard deviations.
And here are some ideas I hope to do in the future:
- Batch normalization: In the original
paperthat introduces batch normalization, they discover a concept which they called
it "covariate shift". This is actually the behavior we see in the Learning basic functions: Initial
experimentation post. When there's a distributional shift, the network kinda
collapses its learned geometry and just try to drag itself over to a high entropy zone, then
reestablish the geometry. The paper proposes to make this process easier for the network by
making it easy to adapt to the distributional shift without collapsing its learned geometry.
We will be testing this on the damn simple function network, and see what is the general
distribution of the layers.
- Network stability: While thinking about batch normalization, I get a feeling that the
problem can manifests itself in a whole lot of places and can slip into virtually every part
of the network as they are usually porous. I also get a feeling that the problem is much
more fundamental than covariate shift, and the actual problem we're facing is that deep
networks are inherently unstable, so this unstable behavior is what gets interpreted as
covariate shift. So I hope to describe this unstable behavior better, again with a lot of
experimentation. If possible, I hope to discuss the domain adaptation method mentioned at
the beginning of the batch normalization paper.
- Dropout vs L2 regularization: When I first heard about dropout (original
paper) as an actual method that people use, I could not understand why it works at
all. If we have a dropout rate of 20%, then the evaluation mode's internal weights should be
1/0.8 times larger than the training mode. This should just fuck the hell up in the
network's own representation in the default case, and work fine only if the network is
fighting the behavior, which at least should degrade the performance. To add to this, the
default dropout percentage is 50%, which is so incredibly large. So I'm very curious as to
what happens when vary that number around. So how does it work at all? Furthermore, L2
regularization actually feels like it should work, because it actively smooths out the
weight landscape, making it easier for the network to actually find a minimum. However, L2
regularization isn't used much at all every where I look. So what's up with that? Do people
just find out the performance is bad? Also, dropout is sort of viewed as random forests of
networks, but only with 1/2 the cost, not 1/10 the cost like it normally is.
- Dropout position in CNNs: Typically, dropout is used in the linear layers after the
activation and after all convolutions have concluded. What happens when we mix those around?
- Up sampling of generated images: The idea is that we have a network for generating images at
25x25, then we can freeze that layer, then train it on a dataset at 50x50 using a simple
inverse convolution layer (with batch normalization in the way of course). Then unfreeze the
previous layers, then train the whole thing. Do this again and again, to see the limits of
this.
- Using GANs to extract features: It seems like I can extract internal information from the
GANs to use as a pre-trained model for a classifier. So verify this behavior. Also it feels
like we can extract information directly from the GAN pair without having to put in inside a
transfer learning situation. If this is truly the case, then we may have actually cracked
the explainable AI problem that sort of looms over the field for many years.
- Verify initialization behavior seen in Yoshua Bengio's "Understanding the
difficulty of training deep feedforward neural networks". It seems straightforward
enough, but still, vary the parameters around to see if there're anything interesting.
- Using pre-trained networks as opposed to creating it from scratch in DCGAN architecture:
DCGAN is basically a GAN with deep convolutions in them, described in the original paper. They expressed pain
in adjusting the right hyperparameters, so should it be easier to just slap resnet34 or
something in front?
- GAN interpolation: I'm interested in testing 2 situations out. 1 is a network taking in a
single integer as input and generates a 28x28 image. The discriminator will only be given 10
images, each is a digit. Then see whether the generator essentially learns to copy the 10
images seen by the discriminator. Also, at that point, vary the initial number passed in so
that they're in the middle of integers. See if the network actually understand the concept
of "between 2 and 3", for example. An extended experiment is to not pass in just a number,
but to pass in a 100 feature latent vector, but keep the first parameter as the digit
information. Then try to reduce those 100 features down slowly, and watch what happens.
- GAN, where 1 party is super strong, but is frozen: It will be interesting to see what
happens when either the generator or the discriminator is super strong from the get go
(trained from another GAN pair earlier and achieved superhuman results), but is frozen and
can't keep up with the opposing party if that party even manages to come close in
performance. This will serve as the upper extreme point for GAN use cases, and the above
should be the lower extreme point, so we can test everything in between and see what makes
sense and what's not.
- Actually analyzes GANs when they approach superhuman level: In Ian Goodfellow's original paper about GANs, he
mentioned that they don't usually converge to the theoretical maximum predicted by game
theory. So what actually happens when they get to superhuman level and is saturated? What
does the internal representation looks like?
- Analyzes GANs unstable behavior: Also, GANs actually seem like they are extremely unstable.
I mean, why won't the generator just shuts off the input by driving the input kernels to 0,
and always learn to generate the exact same images as the supposedly "real" one? Suppose it
has an internal capacity to store 10 images, then the discriminator will just classify those
10 images as fake, sacrificing those 10 images, because there are probably thousands of real
images, so it's an infeasible policy for the generator to just remember 10 images. So I hope
to figure out how many images can the generator actually stores within its internal
representation, and vary the real samples seen by the discriminator to see whether this kind
of rote memorization is actually what's going on here.
- Make GANs able to understand the physical world better: Predicting physics phenomenon is
still sort of my wet dream. If we can do that, then it seems like we have actually created a
primitive version of a general agent, which is pretty huge. So the setup is to make it
generate a series of physical phenomenons through time, kinda like how we did the learn
simple function thingy from before. And if it truly understands the world, then with the
experiment way above about explainable AI, can we extract knowledge from it?
- Another distributional shift case: I was wondering what happens if we train a simulated
robot to do something with joints and without gravity. Then do the same thing with gravity
on. Will it be able to learn that it has to deal with gravity?
- Try to devise games that are difficult and very unintuitive for humans to play, then try to
publish it and see how people figured that out. This takes inspiration from the fact that
DeepMind's DQN system can't really play Montezuma's Revenge. The reason being, it takes lots
of steps to actually have feedback on how well it's doing, so naturally it can't play it as
well. Humans seem likely to have the same weakness, so try to break humans, and see if
there're anything interesting observations.
- Actually try out some actual phenomenon inside regular human brains, to see do they actually
work, and again, to get a feel for them
- In a
previous
post, when we were creating our own optimizer, the network can train when loss
function is -log(x) or 1/sqrt(x) only. 1/x and 1-x doesn't really work. However, Adam can
train the network when the loss function is 1/x, or x^-3, or even x^-5. That's pretty
impressive, so try to explain why can it do that, cause Adam here seems to be much stable
than our dumb implementation.
- Dive into how xgboost works. I heard it's good for tabular data.
- Try to do the auto encoder thingy, where you squish an input to a lower dimension, then
expand that again to a higher dimension. So, let's start with 1k -> 500 -> 1k. Measure its
characteristics. Then do 1k -> 500 -> 100 -> 500 -> 1k. The hope is that by doing this, we
can find the bare minimum dimension that needs to encode a specific task/domain. Collect
that information across a variety of domain, to see if we can conclude something.