Saturday, August 03, 2013

What are C and C++ pointers? A simple explanation.

The question constantly comes up. It is a sticking point and everyone has a different answer for new C and C++ developers. There was even a stop-motion animation created once upon a time to explain them (Pointer Fun with Binky). They are just one of the reasons that a lot of developers are cutting their teeth, so to speak, on other programming languages (but there are other reasons as well).

C/C++ pointers.

I've given a lot of thought over the years on the best method to explain C/C++ pointers in a way that can be more universally understood. Even if the programming language you are using doesn't have pointers, understanding pointers helps gain an understanding of how the underlying hardware operates behind the scenes to execute code. In other words, they are still relevant to you. A lot of what I'm going to say involves gross oversimplifications of several concepts, but the goal is to understand how pointers work as succinctly as possible.

First, we need a grid:

That grid represents RAM. A generous 64 bytes of RAM. Each byte is called an "address" - kind of like a street address but using only numbers. 64 bytes is more than plenty to demonstrate why pointers exist and are necessary. Let's say our computer program gets the first 32 bytes of RAM to store data. We'll refer to this as the "heap". The other 32 bytes represents the "stack". The heap stores data while the stack stores variables. Let me draw a dividing line to make it easier:

For the purposes of this demonstration, let's use a really simple C program (the same concepts apply to C++'s new/delete):
#include <stdio.h>
#include <stdlib.h>

int main(void)
{
  char *MyStr, *MyStr2;

  MyStr = malloc(9);
  MyStr[0] = 'P';
  MyStr[1] = 'o';
  MyStr[2] = 'i';
  MyStr[3] = 'n';
  MyStr[4] = 't';
  (*(MyStr + 5)) = 'e';
  MyStr[6] = 'e';
  MyStr[7] = '\n';
  MyStr[8] = '\0';
  printf(MyStr);
  MyStr2 = MyStr + 4;
  printf(MyStr2);
  free(MyStr);

  return 0;
}
You may want to: Run the program

Let's assume this is a 32-bit computer. So 32 bits / 8 bits per byte = 4 bytes per pointer. So, in this example, 4 bytes are requires to store a pointer. In the case of the above program, for the sake of this example, it will put the variables (MyStr and MyStr2) on the aforementioned stack. A stack is generally small and used for temporary, rapid access storage purposes (~1 megabyte per thread). The heap is much larger (e.g. could be gigabytes).

Anyway, the stack is manipulated using "hardware registers" that can hold the size of the bit level of a computer (e.g. 32-bits - anyone noticing a theme yet?). Hold on though, what is a "hardware register"?

CPUs are built to perform operations very fast. Once you enter the world of hardware, you start thinking in terms of transistors, NAND gates, and the whole mess involved there to just make technology work. I'm not going to go into that level of detail. What you need to know is that because things get complicated at the hardware level, a CPU needs to operate in such a way that offers the most amount of functionality with the least amount of silicon in the fastest possible fashion. To do this, CPU engineers expose "registers" to end users (i.e. assembler programmers). There are only a handful of registers in a CPU and all computer programs ultimately rely on them. Assuming a 32-bit Intel-compatible CPU, let's see a few of the common hardware registers:

eax, ebx, ecx, edx - generic multipurpose integer storage, 32-bits
esi, edi - index registers, 32-bits, for performing fast memory moves
ebp - stack pointer top for the current function, 32-bits
esp - stack pointer bottom for the current function, 32-bits

As I said, just a handful are available to computer programs, which makes it pretty impressive that software works at all. On 64-bit Intel processors, there are 64-bit registers such as rax, rbx, rbp, rsp, etc. Basically, the 'e' becomes an 'r', but the concepts of how a program works stays the same and the number of available 64-bit registers is still fairly limited. 32-bit is slightly simpler to work with and understand.

So our program starts up, the 'ebp' and 'esp' registers are initialized to the value of 32 by the OS, because that is the location of the start of our stack. The first line of code that matters, char *MyStr, *MyStr2;, increases the value of 'esp' by 8 (2 pointers * 4 bytes per pointer = 8 bytes):

ebp = 32
esp = 40

We now have something that looks roughly like:

MyStr is really the result of the equation "ebp + 0" (32 + 0 = 32). MyStr is human-readable but it eventually has to end up as executable machine code in order for it to be executed by the computer. This is what a compiler does. It translates MyStr references into 'ebp + 0' and maintains that knowledge during the compilation process.

Moving along. In this example, the OS is protecting the first 32 bytes of RAM. Attempting to access that memory without getting permission first will cause the program to crash. Let's represent this important restriction in red:

The next line of code is a call to malloc(). It takes an integer as input, makes a request to the OS to allocate that much space for the program, and then, assuming all goes well, returns a value representing the address of the start of the allocated memory. I'm going to simply ignore how a function call like malloc() is made. It involves pushing and popping registers (ebp, eax, etc.) onto and off the stack but is also not really relevant to understanding pointers. Continuing along, let's say the OS allocates the requested 9 bytes starting at address 20 as a result of the malloc() call:

At this point, let's assume that the value 20, the address returned by malloc(), is stored in the register 'eax'. The next thing that happens is to store that value into memory:

mov [ebp + 0], eax

Which results in:

Remember that 'ebp + 0' is really the variable MyStr. As can be seen, the value 20 has been stored. Note that the image contains a decimal representation rather than the traditional hexadecimal representation. In hex, it would appear as "14 00 00 00" (1 * 16 + 4 = 20). I figure that just spelling it out is easier to digest. The number is probably backwards from what you might expect. I did mention earlier that this is an Intel CPU. Intel uses the "little endian" format to store values in RAM (mostly due to historical reasons). There are other popular CPUs that use "big endian" format, which would store the number, in hex, as "00 00 00 14". The takeaway is that computer technology is supposed to be fast, not necessarily human-readable.

The next few lines "dereference the pointer". The assembly code might look like:

mov eax, [ebp + 0]
mov [eax + 0], 'P'
mov [eax + 1], 'o'
mov [eax + 2], 'i'
mov [eax + 3], 'n'
mov [eax + 4], 't'
mov [eax + 5], 'e'
mov [eax + 6], 'e'
mov [eax + 7], '\n'
mov [eax + 8], '\0'

The register 'eax' gets the value of 20 from the 4 bytes of memory stored at [ebp + 0]. Next, 'P' gets stored at address 20, 'o' gets stored at address 21, and so on.

Which results in:

When someone is asking "How do pointers work?" they really are asking, "How does the computer and my program know where my data really is?" The value stored in MyStr in this example is an address to another location in RAM. The program knows where MyStr is stored because of the value stored in the 'ebp' hardware register. The lack of understanding how the hardware works is why the question is so frequently asked.

I broke out the first 'e' being assigned (i.e. (*(MyStr + 5)) = 'e';) as being different from the rest of the code to demonstrate exactly how so-called 'arrays' really work since most new C/C++ programmers learn about arrays early on. Note that the assembler output is identical (i.e. mov [eax + 5], 'e'). All an array access does is add one value to another to get a new address (20 + 5 = 25) and then dereferences it. MyStr[5] is a shortcut to the long-form of (*(MyStr + 5)). Dereferencing is a fancy way of saying "access the value at this location in memory" and the operation either retrieves it from memory into a register or stores a value into memory either from a register or a hardcoded value. In this case, the code is storing the zero-terminated string "Pointee\n" by storing each character individually.

The next line calls printf() (dangerously - don't do this) and outputs the zero-terminated string that was stored earlier. For this example, what is passed to printf() is the value 20, not the actual string. printf() then dereferences, starting at 20, and outputs characters until it runs into ASCII character 0 (hence the name zero-terminated string).

Take some time to digest everything so far. It is a lot to take in. Compare the source code to the assembly code to what happens on the grid. Everything you need to know about pointers is there. What follows is just more of the same.

The next line of code assigns MyStr2 the value stored in MyStr + 4:

mov eax, [ebp + 0]
add eax, 4
mov [ebp + 4], eax

Which looks like:

The next line makes a printf() call using the value of 24 instead of 20.

The next line of code calls free(). The value 20 is passed to free(), which, in turn, releases the memory to the OS so that it can be used for other purposes.

If you look carefully, you can still see the value that was stored and the address 20 is still stored at 'ebp + 0' and 24 is still stored at 'ebp + 4'. And, depending on the compiler library that manages memory, that memory may still be readable, but you should never attempt to access such memory because random crash bugs will crop up. Memory management behind the scenes gets weird and therefore some crash bugs become very difficult to track down.

And that, my friends, is how pointers work.

Disclaimer (for nerds): I realized, after I had generated all the images and most of this post, that I had messed up how stacks really work on Intel hardware. Stacks grow down, not up, which means everything should have been 'ebp - 0' and 'ebp - 4', but I contest any naysayers that being 100% accurate would have resulted in more confusion. The whole little endian vs. big endian thing was actually a tossup, but, if someone is going to be working with pointers, they'll eventually care about that sort of thing, whereas knowing how a stack grows is a lot less important. And I was too lazy to regenerate all of those images.

No comments:

Post a Comment