GPT and Neural Networks: Underlying Mechanisms

Introduction

As programmers, we often examine the underlying principles of the tools and middleware we use. This article explains the underlying mechanisms of AI models to help readers, especially those without an AI background, better understand and apply large models.

1. Relationship Between GPT and Neural Networks

GPT is widely known. When interacting with it, we usually focus on our input and GPT's output, without knowing how the output was produced. It can appear as a black box.

GPT is a natural language processing (NLP) model based on neural networks. Large amounts of data are fed into a neural network to train the model until its outputs meet expectations. A trained model can accept user input and provide answers based on the key information in the input. To understand how GPT "thinks," it helps to start with neural networks.

2. What Is a Neural Network

Biology tells us that the human nervous system consists of billions of neurons connected together. Artificial neural networks mimic this structure: they consist of multiple layers of artificial neurons. Each neuron receives inputs and produces an output. In a simplified diagram, each circle represents a neuron that performs computation and passes its result to the next neuron.

3. How Neural Networks Process Data

We have seen what a neural network is and its basic structure. How do neurons compute on input data? Before that, how is data fed into a neural network? The following explains using images and text as examples.

How image data is fed

An image is composed of pixels. Each pixel has color information, typically represented in the RGB model using three channels: red, green, and blue. Each channel's intensity is usually an integer between 0 and 255. To store an image, a computer keeps three matrices corresponding to the red, green, and blue intensities. For a 256 by 256 image, three 256x256 matrices represent the image. These matrices can be flattened into a single vector. For example, a 256x256x3 image becomes a vector of dimension 256 * 256 * 3 = 196608. In machine learning, each component of this vector is called a feature; this 196608-dimensional vector is a feature vector that the neural network accepts as input for prediction.

How text data is fed

Text consists of a sequence of characters and is first tokenized into meaningful words. A vocabulary is built from all observed words or a subset of frequent words. Each vocabulary entry is assigned a unique index, converting text into a discrete symbol sequence. Before input to a neural network, the symbol sequence is typically converted into dense vector representations.

Tokenization: ["how", "does", "neural", "network", "works"]
Build vocabulary: {"how": 0, "does": 1, "neural": 2, "network": 3, "works": 4}
Serialized text data: ["how", "does", "neural", "network", "works"] --> [0, 1, 2, 3, 4]
Vectorization: #Here one-hot encoding is used as an example:
[[1, 0, 0, 0, 0]
 [0, 1, 0, 0, 0]
 [0, 0, 1, 0, 0]
 [0, 0, 0, 1, 0]
 [0, 0, 0, 0, 1]]

The vector sequence is then used as input for training or prediction. Having covered input formats, we next consider how neural networks make predictions.

How Neural Networks Make Predictions

First, clarify the difference between training and prediction. Training adjusts model parameters using labeled data so the model learns input-output relationships. Prediction uses a trained model to infer outputs for new inputs.

Prediction in neural networks is based on linear transformations of the input feature vector x. The basic form uses weights w and a bias b to compute a score z, typically via a dot product z = w dot x + b. Here x is the feature vector, w represents the importance of each input feature, and b is a bias term that affects the prediction.

For example, if an input has i features, z is the weighted sum of those features plus the bias. This form is the basis of logistic regression for binary classification. The linear result z is usually passed through a nonlinear activation function, such as the sigmoid function, to map z into the range [0, 1], which can be interpreted as a probability. Thresholding at 0.5 separates positive and negative classes.

Activation functions like sigmoid also introduce nonlinearity, enabling neural networks to model complex relationships. Without activation functions the network can only represent linear transformations. With nonlinear activations and sufficient depth, neural networks can approximate complex functions.

How Neural Networks Learn

After producing a prediction, the network evaluates its accuracy using a loss function, which measures the difference between predicted values and true labels. A common loss for binary classification is the log loss. Learning aims to adjust model parameters to minimize the loss.

Gradient descent is used to iteratively update weights w and bias b to reduce the loss. The step size is controlled by the learning rate: if too small, convergence is slow; if too large, updates may overshoot the optimum. Neural network computation involves two main steps: forward propagation, which computes neuron outputs from inputs; and backpropagation, which computes gradients of the loss with respect to parameters and updates parameters from the output layer back toward the input layer.

4. Summary

Neural network training is an iterative parameter optimization process that minimizes prediction loss. After sufficient training, a model learns meaningful feature representations and weight assignments that allow accurate predictions on unseen data. Trained neural networks are applied across tasks: convolutional neural networks for image classification, recurrent networks for sequence modeling, and multilayer perceptrons for recommendation systems, among others.

This article provides a high-level overview of how neural networks work. Feedback or corrections are welcome.