Convolutional Neural Networks: Analysis and Practice

Introduction

When I first started learning neural networks, I felt lost. I searched a lot of online materials, and many articles promised quick overviews like "understand convolutional neural networks in one article" or "build your own neural network framework in three minutes." Reading too many of those led to long study time without real progress.

After exploring on my own, I gradually understood how convolutional neural networks, or CNNs, actually work. CNNs are widely applied to tasks such as object segmentation, style transfer, and automatic coloring, but fundamentally a CNN serves as a feature extractor. Those applications are built on top of the features extracted by a CNN.

This article skips the biological neuron analogy and starts directly from simple examples.

A Simple Example: Classifying X and O

Suppose we want a computer to recognize whether an image contains the letter X or the letter O. Humans can tell at a glance, but a computer needs examples. We label images with a label, for example Label = X, to tell the computer that a given image represents X.

Not all Xs look the same, so the computer may not recognize a new X it has not seen before. This is the familiar machine learning issue of underfitting.

To handle variations, the CNN must learn how to extract the defining features of images that represent X.

Images and Local Matching

Images are stored as pixel values. Comparing two images pixel-by-pixel is inefficient and brittle, so we use local matching methods commonly called patch matching.

Observing two X images, despite pixel differences they can share local structural similarities. In both images, three identically colored regions might have the same local structure. Instead of matching all pixels globally, we can match locally.

For example, to locate a face in a photo, you could tell the CNN what eyes, nose, and mouth look like, then search the image for those three features. Likewise, from a prototypical X image we can extract three features.

These features are also called filters or convolution kernels in CNNs, typically 3x3 or 5x5 in size. This is where the term convolution appears in the name convolutional neural network.

Note: the convolution used informally in CNNs is not the same as the convolution operation from signal processing in the mathematical sense. The practical sliding-window operation used in many CNN implementations is closer to cross-correlation than to the classical integral convolution in calculus.

Computing a Local Response

The basic local operation is element-wise multiplication followed by aggregation. Take a filter and align it with a same-size window in the image, multiply corresponding elements, and record the results. For a 3x3 filter and a 3x3 window, multiply each of the nine corresponding pairs and collect the nine products.

Next, aggregate those values. In the simple example described here, we take the mean of the nine products. The aggregated value is placed into a new image at the corresponding location. Repeating this process over all window positions produces a feature map.

Why does the window move? The window, or receptive field, slides across the image. The stride determines how far the window moves each step. For stride = 1, the window moves by one pixel. For stride = 2, it moves by two pixels. After reaching the right edge, the window returns to the left and moves down to the next row.

After computing responses for all positions, the resulting feature map represents how well each image location matches that filter. Values closer to 1 indicate strong positive match, values closer to -1 indicate strong negative match, and values near 0 indicate little or no correlation. Each filter produces one feature map, so using three filters on an X image yields three feature maps.

Nonlinear Activation Layer

The convolution layer produces a set of linear responses. A nonlinear activation layer applies a nonlinear function to those responses. A commonly used activation is the Rectified Linear Unit, or ReLU, defined as:

f(x) = max(0, x)

ReLU keeps values greater than or equal to zero and sets negative values to zero. Since values close to 1 indicate a good match and values close to -1 indicate a mismatch, applying ReLU discards negative responses and helps concentrate on positively correlated features.

Pooling Layer

Even after convolution, feature maps can still be large, especially when training on tens of thousands of images. Pooling reduces the spatial size of feature maps. The two common pooling types are max pooling and average pooling. Max pooling selects the maximum value within each pooling window, while average pooling takes the average.

For example, with a 2x2 pooling window, max pooling selects the largest value in each 2x2 block and uses that to build a downsampled feature map. The window then slides according to the pooling stride. Pooling reduces data size while preserving the strongest local responses, which helps make detection less sensitive to precise location within the window.

Basic CNN Configuration

The basic CNN building blocks are convolution layers, ReLU activation layers, and pooling layers. These layers are often stacked: the output of one layer becomes the input to the next. More layers can be added to form deeper networks for more complex tasks.

The final fully connected layers, along with training and optimization methods, will be discussed in a follow-up article.