Deep/Convolutional GMMs

Convolutional GMM with 32 components trained on 16×16 patches. 50 EM iteratons were run and the log likelihood steadily increased through them. I trained on 200,000 patches that were randomly selected from BSDS, zero-meaned and flat patches with standard deviation less than 0.04 were discarded.

Each row shows ten 16×16 patches from the dataset; these are patches that have the maximal posterior probability on that particular filter (i.e. component of the GMM) i.e these 10 patches have the highest probability for that component.

The 11th column in each row shows the corresponding power spectrum of the filter. To interpret the power spectra, note that the patch Fourier spectrum must be canceled out by the filter so that the negative exponential (i.e. probability) of that patch/component is high. This is why the power spectra are “dark” on the inside and “bright” on the outside, the inverse of what we would expect. As seen from the above writeup, we can only determine the power spectrum of the filter since the Gaussian is insensitive to phase.

sample_1_16x16 sample_0_16x16 sample_2_16x16 sample_3_16x16

Here are 24 components trained on 8×8 patches. For both sets of filters and patches, there are significant overlaps in the power spectra of different filters.

sample_1_8x8 sample_0_8x8 sample_2_8x8

I also ran 32 components of 28×28 MNIST digits (using the 60K digits for training with no data augmentation or any other processing). These results (on the 10K test data) are also interesting; there’s definitely lots of redundancy in the components (e.g. for the digit 1), and it would be cool to “reallocate” the model capacity to other regions of the spectrum:

sample_0_28x28 sample_1_28x28 sample_2_28x28 sample_3_28x28


Here’s MNIST trained on 128 components:

sample_5_28x28_128c sample_6_28x28_128c sample_7_28x28_128c sample_8_28x28_128c sample_9_28x28_128c sample_1_28x28_128c sample_2_28x28_128c sample_3_28x28_128c sample_4_28x28_128c sample_0_28x28_128c sample_10_28x28_128c sample_11_28x28_128c sample_12_28x28_128c

Next, I trained a full (brute-force) GMM on MNIST 28×28 images; 32 components for 10 iterations. Here are the top 10 samples for each component (some of the components have no samples assigned to them in the test set). It definitely seems to be better than the convolutional model:

mnist_space_2_32c mnist_space_1_32c mnist_space_0_32c

  1. Does training on different subsets of the data (e.g. different classes) give different enough models, which can be used to discriminate classes?
  2. How can we reduce overlap between the filters so that there is more discriminative power? Also can the filters be made more narrow-band (and is this necessary)?
  3. Generate samples from these models to see what they look like.
  4. How can we reconstruct the filter phases; and is this even important? If we take an input image and just filter it with these zero-phase filters, then the output phase is the same as the input phase; so there is no loss of phase information.

Let’s re-train a 32 component convolutional model and look at the top samples per component along with the filter:

sample_2_layer0_28x28_32c sample_1_layer0_28x28_32c sample_0_layer0_28x28_32csample_3_layer0_28x28_32c

Now, we take the outputs of this first layer, and train on the 32 log probability components i.e. the dataset of size 60000 x 32. The log probs are pre-processed by setting mean to 0 and standard deviation to 1 for each sample. Then we train a full covariance GMM model with 32 components on this second layer (with zero mean components). Finally, we show for each second layer component, which are the samples that respond the most to these probability components are shown below:

sample_2_layer1_28x28_32c sample_1_layer1_28x28_32c sample_0_layer1_28x28_32csample_3_layer1_28x28_32c

Same as above but with non-zero GMM means, and without setting the samples means to 0 or std to 1:

sample_2_layer0_28x28_32c_layer1_32c_nozm sample_1_layer0_28x28_32c_layer1_32c_nozm sample_0_layer0_28x28_32c_layer1_32c_nozmsample_3_layer0_28x28_32c_layer1_32c_nozm

We trained a first layer convolutional model with 32 components on MNIST data with filter size 28×28. The results of the digit assignment to each component along with the filter power spectrum, is given below:


Then we generated a training set as follows: for each digit in the training set, we computed the normalized probability vector as $p(x|c)$, where c are the GMM components. Thus, we generate a dataset of size 60000 x 32, where each row adds upto 1, and is non-negative. We then train a second GMM on this input, with 17 components. The second layer was trained with a deterministic EM iterations on the data from the first layer. The resulting top digits for each component are given by:


Then we repeat the above process to train a third layer with 10 components. Actually, we started the training with 16 components, but 6 components were pruned during the EM iterations. The resulting top components were given as follows:


It’s interesting that the distribution seems much more “concentrated”. On the other hand, we see that some digits span two components, where other digits tend to be subsumed. Do we expect that a different training mechanism would better fit these distributions? Can we also do a top-down training which will take the output from the top layer and use that to refine the EM training of the lower layers?



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s