Introduction
I should make clear up front that this article is not intended as a review of large language models. Clearly, 2023 was an extraordinary year for artificial intelligence, so there is little need to rehash that. This piece is more of a personal account from a programmer.
Since the appearance of ChatGPT and the availability of locally run large models, I have been applying this technology broadly. My goal is not only to improve coding efficiency but also to avoid wasting time on parts of programming that do not require much effort. I do not want to spend hours hunting through dry documentation, learning complex APIs that are often unnecessary, or writing throwaway code that I will discard a few hours later. Especially now that search results are so noisy, finding the few useful resources can be arduous.
At the same time, I am not new to programming. I can write code without assistance and often do. Over time I started relying more on large language models to write higher-level code, particularly Python, and less so for C. Through this experience I learned when these models help and when they slow me down. They resemble an encyclopedia or a large set of online tutorials: extremely useful to those who are willing, able, and disciplined, but more limited for others. I worry that, at least initially, they will mostly benefit those who already have advantages. But let’s proceed step by step.
Omniscient or Parroting?
One concern in the machine learning wave is that some AI experts struggle to accept the limits of their knowledge. Humans invented neural networks and, more importantly, the algorithms that automatically optimize neural network parameters. As hardware improved, it became possible to train larger models and exploit statistical patterns in data to find model designs that work well. Nevertheless, neural networks remain complex and opaque.
Some cautious scientists underestimated the new, hard-to-explain capabilities of large language models, thinking they are just slightly advanced Markov chains that can at best regurgitate variations seen in the training set. Increasing evidence suggests that view is probably wrong. At the same time, many laypeople overestimate these models and attribute unreal powers to them. In reality, large language models can at most interpolate within the space represented by their training data. Even so, that capability is impressive. If a model could perfectly interpolate across all seen code, it could replace a very large fraction of programmers. The truth is less optimistic. Models can write programs they have not literally seen before and blend training ideas, but their abilities remain limited, especially where fine-grained reasoning is required. Still, these models represent one of the most significant achievements in AI to date.
Ignorant but Knowledgeable
Large language models typically perform shallow reasoning and sometimes invent facts, yet they hold massive amounts of knowledge. In programming and other domains with high-quality data, they are like knowledgeable but limited-understanding partners. Pair programming with such a partner can be frustrating when they propose nonsensical ideas, but if the partner follows instructions and answers questions reliably, the experience changes. Existing models cannot lead us beyond known paths, but they can help a programmer go from ignorance to enough competence to proceed independently. Twenty or thirty years ago this would have been less noticeable; then you needed a few languages, classic algorithms, and a handful of libraries. Now the proliferation of frameworks and libraries makes a “well-informed but limited” assistant valuable.
For example, my early machine learning experiments used Keras for at least a year before moving to PyTorch. I knew concepts like embeddings and residual networks but did not want to dive deeply into PyTorch documentation. With a large language model, writing Torch-based Python became easy: I only needed a clear description of the model I wanted and to ask the right questions.
Use Cases
I'm not talking about trivial questions like "How does class X perform operation Y?" More complex tasks are where modern models excel. A few years ago these capabilities seemed magical. I once asked GPT-4 to inspect a PyTorch model implementation and some data batches, and to write code that reshapes tensors to match the network input and presents them in a particular way. GPT-4 produced the code; I then tested tensor sizes and data structures in a Python REPL.
Another example: I needed a Bluetooth Low Energy client for an ESP32-based device. Cross-platform BLE APIs are often poor; the practical solution was to write code in Objective-C using macOS native APIs. That presented two problems: learning Objective-C's complex BLE APIs and recalling Objective-C programming details such as event loops and memory management. The resulting code, while not elegant, worked and was written in a very short time. This code was mostly assembled by copying and pasting the functions I wanted to implement into ChatGPT and iterating until the behavior was correct. Large language models pointed out bugs and suggested fixes. Although most of the code was not written purely by the model, it significantly accelerated development. Without ChatGPT I could have completed the task, but the model made me willing to attempt it in the first place. The project also produced a useful side effect: I adapted linenoise, a line-editing library I use, to work safely in a multiplexed environment.
Here is another case focused more on interpreting a model than writing code. I found a convolutional neural network in ONNX format without detailed documentation. ONNX made it possible to identify the model inputs and outputs and their names, but I did not know the required input image format and size. I pasted the ONNX metadata into ChatGPT and summarized what I knew. ChatGPT inferred how inputs were organized and that outputs were likely normalized bounding boxes and scores indicating potential defects. After a few minutes I had a Python script to run inference and code to convert images into input tensors. When I showed the raw output logits for test images, ChatGPT interpreted them in context, deducing details like whether boxes were centered or referenced by top-left corners.
One-off Programs
Many of my interactions are similar: I need to quickly learn something or produce disposable programs. When I must verify model outputs that might be nonsense, I use the model to accelerate learning but still validate results. In other cases I let the model write the entire program when it is a throwaway task.
Example: a simple demonstration program to visualize loss curves for small neural networks. I showed GPT-4 the CSV format produced by a PyTorch training run and asked for a script that, when given multiple CSV files on the command line, would avoid plotting both training and validation curves for the same experiment and instead compare validation curves across experiments. GPT-4 produced a working script in thirty seconds.
import pandas as pdpd.set_option('display.max_rows', None)df = pd.read_csv('listings.csv')reservations = df[df['Type'] == 'Reservation']reservations['Start Date'] = pd.to_datetime(reservations['Start Date'])reservations['Year'] = reservations['Start Date'].dt.yearreservations['Month'] = reservations['Start Date'].dt.monthreservations['Nightly Rate'] = (reservations['Amount'] - reservations['Cleaning Fee']) / reservations['Nights']all_listings = reservations['Listing'].unique()all_years = reservations['Year'].unique()all_months = range(1, 13)index = pd.MultiIndex.from_product([all_listings, all_years, all_months], names=['Listing', 'Year', 'Month'])all_data = pd.DataFrame(index=index).reset_index()merged_data = pd.merge(all_data, reservations, on=['Listing', 'Year', 'Month'], how='left')average_nightly_rates = merged_data.groupby(['Listing', 'Year', 'Month'])['Nightly Rate'].mean().fillna(0)
Grouping scattered, messy data correctly requires simple logical reasoning. This is not something a model solves purely by echoing language patterns it saw during training. Although GPT-4 likely observed similar programs during training, it almost certainly did not see a program matching my exact CSV format and grouping requirements. To me, this indicates that large language models can perform limited innovation and reasoning within the space defined by training programs. Writing such one-off scripts myself is usually a poor use of time; having a model handle them improves my productivity and lets me focus on more important work.
When Things Go Awry: Systems Programming Challenges
While I have had success using large language models for higher-level coding, in C and systems programming I typically use them as a more convenient documentation tool. As a systems programmer, I find models only helpful when they can perform nontrivial reasoning, and in that regard they often fall short.
For example, I asked a model to outline and implement a compact, efficient C Bloom filter, with emphasis on the hashing functions. Here is one such model-generated implementation:
#include <stdio.h>#include <stdlib.h>#include <math.h>#define N 100000 // Number of elements#define P 0.05 // False positive probability#define M (int)(-(N * log(P)) / (log(2) * log(2))) // Size of bit array#define K (int)(M / N * log(2)) // Number of hash functionsunsigned char bitArray[M / 8 + 1];void setBit(int n) { bitArray[n / 8] |= 1 << (n % 8);}int getBit(int n) { return bitArray[n / 8] & (1 << (n % 8));}// Basic hash functions (for demonstration purposes)int hash1(char *str) { int hash = 5381; int c; while ((c = *str++)) hash = ((hash << 5) + hash) + c; return abs(hash % M);}int hash2(char *str) { unsigned long hash = 0; int c; while ((c = *str++)) hash = c + (hash << 6) + (hash << 16) - hash; return abs(hash % M);}void add(char *str) { setBit(hash1(str)); setBit(hash2(str)); // Add more hash functions if needed}int is_in(char *str) { return getBit(hash1(str)) && getBit(hash2(str)); // Check more hash functions if used}int main() { char *testStr = "testString"; add(testStr); if (is_in(testStr)) { printf("%s is probably in the set.\n", testStr); } else { printf("%s is definitely not in the set.\n", testStr); } return 0;}
The produced Bloom filter was not great. Bloom filters are common, but implementing a correct, efficient version requires higher-level abstraction: you need a reliable way to derive multiple independent hashes from the same input while ensuring their mutual independence. If I instruct GPT-4 to improve the hash function so it yields multiple distinct outputs, it proposed a more reasonable approach:
unsigned int hash(const char* str, unsigned int hash_id) { unsigned int hash = 5381; int c; while ((c = *str++)) { hash = ((hash << 5) + hash) + c + hash_id; } return hash % BF_SIZE;}
If GPT-4 had proposed this independently, the implementation would look different, possibly using a single hash function to set multiple bits. GPT-4 can produce suitable hashing strategies, but during the overall Bloom filter implementation it often lacked adequate reasoning and instead produced two similar hash functions—an outcome reminiscent of low-quality articles. The limited reasoning ability and the mixed quality of available resources explain such results.
This is not an isolated case. My experience with LLMs in algorithms or systems programming is often poor, even though their reasoning in Python can be better. That said, GPT-4 can decompile functions it generated and accurately explain their purpose in other sessions, so claiming it is useless for systems programming is wrong. Model size matters: larger models generally perform better for reasoning tasks.
For example, Mistral's expert MoE model Mixtral responded to the same hashing question with:
unsigned int hash(const char* str, unsigned int hash_id) { unsigned int hash = 5381; int c; while ((c = *str++)) { hash = ((hash << 5) + hash) + c; } return hash + hash_id;}
Appending hash_id to the final hash is a poor strategy. Mixtral is an excellent model for many applications, but for tasks that require strong reasoning, bigger often seems better.
Here is an excerpt from a local conversation with a large model I ran in llama.cpp. Because of RAM limits I quantized the model to 4 bits. Even so, this 34-billion-parameter model demonstrated strong reasoning for this problem. My question was essentially: I have a hash function that should produce N different hashes for the same input, but when I change hash_id the distribution is poor. Is my salting strategy inadequate, and how should I improve it?
unsigned int hash(const char* str, unsigned int hash_id) { unsigned int hash = 5381; int c; while ((c = *str++)) { hash = ((hash << 5) + hash) + c; } return hash + hash_id;}
The model suggested that simply adding hash_id at the end leads to uneven distribution and recommended mixing hash_id with bit operations. For example, it recommended using XOR during the accumulation and once more at the end to better mix hash_id. The model combined additions and XORs effectively and identified the real issue based on the clues I provided. Such solutions are not accessible through books or simple web searches alone.
However, from my recent months of experience, for systems programming a seasoned developer will often not get satisfactory solutions from LLMs. A real-world example: my recent project ggufflib involved creating a library to read and write GGUF files, the format used by llama.cpp for quantized models. To understand how quantization encoding works, I initially tried ChatGPT, but ultimately reverse-engineered llama.cpp code because it was faster. A model that effectively helps systems programmers should be able to reconstruct data format documentation given the data structure declarations and decoding functions. Even though llama.cpp is short enough to fit into GPT-4's context, the model's output was not helpful in this case. We fell back to the traditional approach: pencil and paper, reading code carefully to see how bits are extracted.
To illustrate, here is a structure from the llama.cpp implementation:
// 6-bit quantization// weight is represented as x = a * q// 16 blocks of 16 elements e// Effectively 6.5625 bits per weighttypedef struct { uint8_t ql[QK_K/2]; // quants, lower 4 bits uint8_t qh[QK_K/4]; // quants, upper 2 bits int8_t scales[QK_K/16]; // scales, quantized with 8 bits ggml_fp16_t d; // super-block scale} block_q6_K;
And here is the dequantization function used by llama.cpp:
void dequantize_row_q6_K(const block_q6_K * restrict x, float * restrict y, int k) { assert(k % QK_K == 0); const int nb = k / QK_K; for (int i = 0; i < nb; i++) { const float d = GGML_FP16_TO_FP32(x[i].d); const uint8_t * restrict ql = x[i].ql; const uint8_t * restrict qh = x[i].qh; const int8_t * restrict sc = x[i].scales; for (int n = 0; n < QK_K; n += 128) { for (int l = 0; l < 32; ++l) { int is = l/16; const int8_t q1 = (int8_t)((ql[l + 0] & 0xF) | (((qh[l] >> 0) & 3) << 4)) - 32; const int8_t q2 = (int8_t)((ql[l + 32] & 0xF) | (((qh[l] >> 2) & 3) << 4)) - 32; const int8_t q3 = (int8_t)((ql[l + 0] >> 4) | (((qh[l] >> 4) & 3) << 4)) - 32; const int8_t q4 = (int8_t)((ql[l + 32] >> 4) | (((qh[l] >> 6) & 3) << 4)) - 32; y[l + 0] = d * sc[is + 0] * q1; y[l + 32] = d * sc[is + 2] * q2; y[l + 64] = d * sc[is + 4] * q3; y[l + 96] = d * sc[is + 6] * q4; } y += 128; ql += 64; qh += 32; sc += 8; } }}
I asked the model to generate a simplified function demonstrating how the data is stored. The generated function had multiple issues: incorrect indexing, wrong sign-extension when converting 6-bit values to 8-bit, and similar errors. In the end I wrote my own decoder:
if (tensor->type == GGUF_TYPE_Q6_K) { uint8_t *block = (uint8_t*)tensor->weights_data; uint64_t i = 0; // i-th weight to dequantize. while(i < tensor->num_weights) { float super_scale = from_half(*((uint16_t*)(block+128+64+16))); uint8_t *L = block; uint8_t *H = block+128; int8_t *scales = (int8_t*)block+128+64; for (int cluster = 0; cluster < 2; cluster++) { for (uint64_t j = 0; j < 128; j++) { f[i] = (super_scale * scales[j/16]) * ((int8_t) ((((L[j%64] >> (j/64*4)) & 0xF) | (((H[j%32] >> (j/32*2)) & 3) << 4)))-32); i++; if (i == tensor->num_weights) return f; } L += 64; H += 32; scales += 8; } block += 128+64+16+2; // Go to the next block. }}
I removed the long comment that documented the exact format used by llama.cpp's Q6_K encoding, since that was the core contribution. If a model could reliably produce that documentation from the code, it would be very helpful. I believe this is a solvable problem with appropriate model improvements and context handling.
Reevaluating Programming Work
One fact is evident: much modern programming is minor variations on the same themes. This work does not require deep reasoning, and large language models perform very well on it, though they are still constrained by context length. Programmers should consider whether spending time on such tasks is worthwhile. Although these jobs pay, if LLMs can handle parts of them, they may not be the best career focus over the next five to ten years.
Do models truly perform reasoning, or is it an illusion? Sometimes their output looks like reasoning, but that could be a semantic illusion created by symbol manipulation. Those who study these models closely will recognize that they do more than simple token repetition: pretraining forces them to predict next tokens in a way that builds an abstract internal model. That model may be fragile and incomplete, but empirical evidence suggests it exists. In domains where mathematical certainty is unclear and experts disagree, trusting your judgment seems prudent.
Why Not Use LLMs for Programming?
Asking the right questions of a large language model is a critical skill. The less you practice it, the less effective you will be at using AI to improve your work. Clear communication matters whether you interact with models or humans. Poor communication is a serious obstacle; many competent engineers are poor communicators. Search engines are also less helpful than they used to be, so using LLMs as a document compressor is often worthwhile. I will continue to use them extensively. I avoid delving into obscure protocol details or complex library internals written by people trying to impress. Those feel like "useless knowledge" to me. With LLM assistance, I can avoid many of those frustrations and be more productive day to day.