Introduction to hexadecimal, binary and base-n numerical representation systems

This text will help you understand two-based numbering systems (like binary and hexadecimals) and how to convert numbers to and from decimal systems into them.

If you want to see the algorithms read the prior to last section for natural numbers and the last one if you also want to consider the whole range of integer numbers.

For the rest of the document, the first Introductory section covers the concepts used to talk about numeric representations and a bit of the history behind them. It can help you understand better the math we will introduce later but you can skip it if you want. The second one introduces how the different binary representations of numbers work and how to represent numbers in them. The third one then covers some mathematical concepts behind different base representations. The fourth one introduces the algorithms to convert numbers in the natural (ℕ) domain between bases. And finally the fifth one explain how to work with numbers in the integer (ℤ) domain including negative numbers

Introduction

In the modern society, we are so used to numbers that most perceive them as a self evident part of the world. For example, if you see 2026 you may guess it refers to the current year and 42 is the answer to the ultimate question of life the universe and everything.

But the same number can be represented in different ways. For example, 42 can be written in many other ways, among them:

  • In Tally notation (unary): 𝍸𝍸𝍸𝍸𝍸𝍸𝍸𝍸𝍪
  • In Egyptian (base 10): 𓎉𓏻
  • In Chinese Xiǎoxiě (base 10): 四十二
  • In Indus valley Script (believed to be base 10):𓏻𓎇𓎇
  • In Aegean (base 10): 𐄓𐄈
  • In Old Italic (base 10): 𐌢𐌢𐌢𐌢𐌠𐌠,
  • In Roman (base 10): XLII
  • In Mayan (base 20):𝋢𝋢
  • In binary (base 2): 101010
  • In octal (base 8): 52
  • In hexadecimal (base 16): 2a
  • In duotrigesimal [0-1a-v] (base 32): 1a
  • In quadrasexagesimal [0-1a-zA-Z\-_] (base 64): G
  • And even with the corresponding ASCII character (base 128): *

In fact, depending on which culture you grew up in you may convey the same amount in different ways. How did we end up with this situation?

Number of different digits (base)

It is believed that during prehistory humans used their fingers (and maybe other body parts like the Oksapim) to keep track of quantities. This system works reasonably well for smaller amounts but it became unfeasible for larger amounts. To be able to reason about larger amounts, humans started grouping items together and reasoning instead about groups of items. For example, a person could make groups of 5 animals and count how many groups were left and then how many animals were left after counting the groups. With groups in place it was easy to have groups of groups and so on to reason about larger numbers.

To keep track of larger amounts, humans started making small marks (representing fingers) on objects like sticks and bones. These tally marks representing fingers (and whole hands) eventually evolved into the ancient written numeral systems. Because of this use of fingers and hands, the majority of ancient numerical systems have 4, 5, or 10 different digits. Although exceptions exist like Hawaiian with 12 numbers, Oksapim using 27 body parts, or Sumerian which combined groups of 10 parts with groups of 6 for a total of 60 different numbers.

In computer science and mathematics the number of different digits is usually called the base of the system.

Conveying larger numbers (sign vs place-value)

Most of those archaic systems were using sign-value notation where different symbols represented different group sizes and the number of symbols was added. For example Egyptians, with one of the oldest written numerical systems known, would write 42 as 𓎉𓏻 with four 𓎆 symbols each representing 10 and 2 𓏺 symbols each representing 1. In some languages certain numbers were subtracted from the larger group for example in roman numerals 42 would be written as XLII where the 10 represented by X would be subtracted from the 50 represented by L to obtain 40 to which the 1s represented by each of the two Is would be added. Chinese numbers had also a multiplicative notation where they added a numerical multiplier to the numbers to state the final number. For example for 42, Chinese Xiǎoxiě would write 四十二 meaning four times ten and two (times one). These systems were fairly popular because they made simple arithmetic like addition and subtraction as simple as grouping together equal kinds of symbols (and occasionally replacing one group of smaller symbols by a single symbol of the larger group when adding or performing the opposite operation when subtracting).

Unfortunately, all of these systems (and the majority of the early systems used by humans) needed to add new symbols when they needed to represent larger amounts. To address this, Babylonian scholars improved on the Sumerian system they had inherited and created a place-value (or positional) notation where the position of the symbol in the written number denoted which size of group they were talking about. This allowed them to write numbers as large as they could fit on their written media instead of being limited by the number of different symbols they had available. The Babylonian system had up to 60 different digits made by combining up to nine 𒐕 symbols for the units of the digit with up to five 𒌋 symbols for the tens. To indicate that there where no items at a specific position when writing numbers, most Babylonian scribes would just leave an empty space, although in some cases different placeholders appear in their place. A common one was 𒑊 (a mirrored version of 𒑊), although there is a tablet where the scribe used instead 𒌍. For example, the prime number 3607 would be written as 𒐕 𒐛 (i.e. one time 3600, an empty space indicating 0 times 60 and then a 7). Unfortunately those Babylonian scholars, also removed the final “0”s at the end of the number since they could usually be interpreted by context, so 𒐕 𒐛could also mean 216420, 12985200, or any other power of 60 multiplied by 3607.

Chinese mathematicians also had a similar notation called Suànchóu based on counting rods. To distinguish units from tens they would place the sticks vertically or horizontally. When reaching five they would replace the five rods by a symbol similar to a T. This symbol was 𝍥 (inverted) for the units and 𝍮 (not inverted) for the tens. Therefore, they would have a total of up to 100 digits (using an empty space when there were no items in the group). Thus 42 would be written as 𝍬𝍡 and 1337 as 𝍩𝍢𝍫𝍦 (1 thousand, 3 hundreds, 3 tens and 7 units). As with Babylonian, since the final zeroes would be represented as empty spaces, these could also be missed when writing.

This notation was also invented independently in Mesoamerica and used in Mayan texts. Mayans wrote vertically putting the largest numbers at the top. They used up to four instances of the symbol 𝋡 for the units and up to four instances of the symbol 𝋥 for the fives. The Mayans also had an oval like symbol (represented in different ways) which they used when a digit has no items: 𝋠 although like Babylonians they used other placeholders too. Hence, the prime number 407 would be written as:𝋡𝋠𝋧

Meaning 1 time 400, 0 times 20 and one time 7. Unlike the Babylonian and Chinese counting rod systems, the Mayan system did write the placeholder 𝋠 at the end of their numbers in order to indicate which power of twenty the amounts had to be multiplied with.

In computer science and mathematics notations are nowadays always place-valued (also called positional), but as you can see, that was not always the case.

Which digit comes first (endianness)

In purely sign-value addition-based numeric systems like the Egyptian one the order in which the digits were presented did not matter. Hence 42 could be presented as any combination of four 𓎆(a symbol similar to ^) and 2 𓏺 (a symbol similar to I). Thus both ^^^^II or II^^^^ represent 42, but so would for example do ^ÎÎ^. This is because each instance of the symbol means one of the value it represents.

Nevertheless, numbers were usually represented in the same order they were spoken. Etruscans, and later Romans, used to represent certain numbers by removing a smaller number from a larger one and they used therefore the smaller number position to denote if the number should be added or substracted. See for example the Etruscan numbers 𐌠𐌠𐌠𐌢𐌢 (17) and 𐌢𐌢𐌠𐌠𐌠 (23) where the leftwise 𐌠s substract one from the 𐌢𐌢 but add one instead when placed on the right. Multiplicative systems like the Chinese Xiǎoxiě had a similar problem in that placing the number before or after the multiplier digit would change whether it would be multiplied or not.

Similarly to the last two systems, place-value notations are strongly dependent on the order in which symbols are placed. For example, even with the difference between even and odd digits from the Chinese Suànchóu system, the number 𝍠𝍪𝍢 could mean 123 if read from left to right and 321 if read from right to left.

Usually we would present first the most meaningful digit when talking and therefore do the same in writing. But even in that case opinions on what had the most meaning change between cultures. For example, certain cultures (for example most European ones) would consider the days in a date the most meaningful writing dates as number of the day then number of the month and finally the number of the year. Others (for example in China, Korea or Japan) would consider the year the most meaningful and write it first followed by the month and the day. And then we would even have some (for example the United States) which would consider the month as the most meaningful thing and follow it with the year and then the month.

For computers, the ordering of digits plays a similar role and, although they usually work with words (groups of various bits), each of which represents a digit; the order in which they have to be interpreted also matters. In this case, what matters is which digit comes first and, given that computers number each of their memory positions, we usually consider the positions with the smallest numbers to come before those with larger ones.

Hence we use the word endiannes to refer to which side of the memory each digits is placed, at the beginning or at the end (where the word endian comes from). When the smaller digits come first, we talk about little endian, while when the larger digits come first we call it big endian. For example in decimal, a 4 followed by a 2 would convey forty two in big endian but twenty four in little endian.

The terms little and big endian originated from Jonathan Swift’s book Gulliver’s Travels. In the book, the people of Lilliput had religious quarrels over how to interpret the rule that “boiled eggs should be broken on the convenient side”. The faction of the Big-Endians broke their boiled eggs at the larger end; meanwhile, the Little-Endians broke them at the smaller one. According to the book, those quarrels caused “six rebellions… wherein one Emperor lost his life, and another his crown”.

As in the books, in the computer science world there are similar arguments about which way is the best one to place the digits in computers. And, as is the case with eggs, what is the best way depends on many different factors each having different algorithmic and hardware trade-offs.

Historically, most business computers used big endian representation, which is the reason why many network protocols also use that. The PDP-11 used instead little endian (for two “digit” numbers) and a mix of little and big endian for four “digit” ones. Finally, little-endian architectures became more popular with the Intel 8008 and many others architectures that followed after it. In fact the device on which you are reading this text is most likely little endian although most of the devices on the Internet it travelled through before arriving to you are instead big endian.

Almost all computer architectures have a preference for one endian or the other (for example, RISC-V which is used in this course was originally designed to be little-endian). Although many of them, including RISC-V, can be configured to choose one or the other.

The modern decimal system

The numerical system which we use today which like the two above is positional but using a base 10. The positional system may have originated from the Chinese Suànchóu (units: 𝍠 𝍡 𝍢 𝍣 𝍤 𝍥 𝍦 𝍧 𝍨, tens: 𝍩 𝍪 𝍫 𝍬 𝍭 𝍮 𝍯 𝍰 𝍱, and zero being an empty space). While the digits may derive from the originally sign-valued Brahmi numerical system (𑁒𑁓𑁔𑁕𑁖𑁗𑁘𑁙𑁚). Both may have been combined by Hindu mathematicians (𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯) between the 1st and 4th century. The system then spread to Arab mathematicians (०۱۲۳۴۵۶۷۸۹) in the 9th century who expanded it to include the decimal point we used for decimal fractions. From there it spread to Europe during the Middle Ages and, with the reinvention of printing, was eventually standardized as the digits that we are used to see (0123456789).

To read more about positional systems and the history behind our numeration system, Boyer’s Zero: The Symbol, the Concept, the Number is a good place to start.

Base-2 numerical systems

Many different cultures had different uses of binary-like systems before they were formalized in math including the Egyptian, Chinese, Etruscans, Greek, Hindu, Yoruba (Africa). Similarly, many mathematicians had used them in some way or another including Ramon Llull, Francis Bacon, John Napier, Thomas Harriot, Juan Caramuel y Lobkowitz. Nevertheless it was through Gottfried Leibniz’s works that the binary arithmetic approaches that we use nowadays became popular.

Currently, most modern computers operate using base-2. That is groups of numbers were each value is represented either as a 0 or a 1. This means that numbers also need to be represented using groups of those two digits.

BCD

One simple way to map our base-10 (digits 0 to 9) numbers was representing each digit using 4 bits (covering the values 0 to 9 and discarding the remaining 6 values possible). This is known as binary coded decimal and was a popular, although inefficient, way of representing and operation on numbers, for example on the Busicom calculators which inspired the Intel 4004. 42 would be represented in big endian (most significative digit first) using this system as 0100 0010 (the first group of four bits representing the number 4 and the second one the number 2).

Unfortunately, BCD representation did not make full use of the resources that computers had. Thus numbers were usually converted from our base 10 representation into strings of of 0s and 1s that made more efficient use of the expensive memory available on early base-2 computers. For example 42 could be represented in 6 instead of 8 bits as 101010.

Binary

Compared to BCD, binary is a more efficient way to represent numbers encoded using only two different digits.. In binary each individual digit (or bit) is represented as a either a one or a zero. Usually with the bit conveying the largest magnitude placed to the left (first when reading from left to right). For example with the number 42 the digits 101010 represent the magnitudes 32, 8 and 2 which added give 42.

C does not offer a way to represent numbers in binary although other languages, like python, do. In python, to prevent confusion with normal decimal numbers, the digits are prefixed by 0b. Thus in python 42 in binary is presented as 0b101010.

Octal

Although strings of single bits work really well for computers, humans are better suited to write down values and think about them when using a larger set of possible digits. Octal uses 8 of the 10 ones we use nowadays (01234567). Octal representation of data on computers became popular with the advent of 36-bit computers on the 1950s which tried to provide the same precision as the 10 digital column electro-mechanical systems popular before that. These systems used also 6-bits to represent single alpha numerical characters. Consequently, being able to group bits in groups of three made a lot of sense making octal fairly popular then.

In octal we group the digits from the binary string in groups of three for example for 42 (101010) we would split the bits into 101 (5) and 010 (2) thus becoming 52. When we have a number of bits that is not a multiple of 3, we add extra zeros to the most significative group. For example, with the decimal number 1337 (10100111001): we would split it as 10 100 111 001, then add an extra 0 making it 010 100 111 001 and finally convert the groups of three numbers into the number they represent thus making 2471.

To convert octal numbers back to binary, all that needs to be done is writing each digit using the three binary digits it represents, for example the number 1337 in octal would be converted back to decimal as 001 (1) 011 (3) 011 (3) 111 (7) thus becoming 001011011111 (735).

To distinguish numbers in octal from numbers in decimal octal numbers are represented using a marker to denote them as such. In C this is done by prepending a 0 (so 01337 would represent 735 and 052 would be 42). To avoid confusion with decimal systems purposefully prefixed with a 0, other languages use other notations, for example python uses 0o making the numbers 0o735 and 0o52 respectively.

Hexadecimal

Hexadecimal was first used as a word by Lund’s computer scientist Carl-Erik Fröberg, nevertheless the concept existed with different names before that. As computers using word sizes multiple of 8-bits, instead of 6, became more popular, so did hexadecimal representation. Unlike octal, hexadecimal representations needed extra symbols being to represent numbers 10 to 15. The symbols varied over the years and systems although the ones we use in computers nowadays (0123456789ABCDEF), that is digits 0 to 9 followed by letters A to F, first appeared in 1966 with the publication of the Fortran-IV manual.

Other than using 6 more digits (A being 10, B being 11, C being 12, D being 13, E being 14 and F being 15), hexadecimal systems work almost the same as octal but with groups of 4 bits.

So, if we wanted to convert 42 (101010) to hexadecimal we would make groups of 4 bits: 10 1010, then add the extra zeros to the most significative group 0010 1010, and then chose the appropriate digit depending on the amount each group represents: 0010 being 2 and 1010 being A (10). Thus we get 2A. Similarly, for 1337 (10100111001), we would split the groups: 101 0011 1001, add zeroes 0101 0011 1001 and convert each into the digit it represents: 539.

To convert from hexadecimal to binary we do the same process in reverse. Hence 1337 would be 0001 0011 0011 0111 making the final number 0001001100110111 (4919).

In C and most other programming languages hexadecimal numbers are written prefixed by 0x to distinguish them from decimal numbers. Hence 4919 would be written as 0x1337 and 42 as 0x2a.

Mathematical concepts behind single base positional systems

In positional systems with a single base (like decimal, binary, octal and hexadecimal), all of the numbers present a pretty clear mathematical structure. Given the base b(10, 2, 8 and 16 respectively) and the different digits dᵢ with i from 0 to n, the number written down as dₙdₙ₋₁…d₁d₀ can be calculated as: ∑dᵢbⁱ for i between 0 and n. Let’s see this in action:

Decimal 42 would become 4·10¹+2·10⁰=40+2=42. In binary, 0b101010 would be 1·2⁵+0·2⁴+1·2³+0·2²+1·2¹+0·2⁰= 1·32+0·16+1·8+0·4+1·2+0·1=32+8+2=42. With octal, 0o52 would be 5·8¹+2·8⁰=40+2=42. And with hexadecimal, 0x2a (2 with the digit a being 10) would be 2·16¹+10·16⁰=32+10=42.

Another important property to remember is that in a base b and given a number of digits n, each number small enough to fit in the representation can be only represented in a single way and, similarly, each number in that base only represents one amount. This means there is only one way to represent forty two in decimal (42), binary (101010), octal (52) and hexadecimal (2a). You can pad the number to the left with 0s and remove them if you want, they don’t convey any meaning and therefore are not considered.

Algorithms for base conversion

With all of this in place we have only one question left to answer: how can we convert numbers to and from any arbitrary base?

In the prior section we have already presented a simpler algorithm to convert numbers from any base using powers, and it is easy to see how using the remainder of the division by bⁱ we can also get the different digits to represent a number in base b. Unfortunately, these algorithms are very inefficient. So, can we write something faster?

To convert dₙdₙ₋₁…d₁d₀ from base b into the one we use for our arithmetic, we can perform the following process:

  1. Start with an accumulated value of 0
  2. For each i from n down to 0:
    1. Multiply the accumulated value by b
    2. Convert dᵢ into our arithmetic base and add it to the accumulated value
  3. The final number will be in the accumulated value

Similarly to convert a number N in our arithmetic into dₙdₙ₋₁…d₁d₀ in base b we can do as follows:

  1. Make i equal to 0
  2. While N is greater than 0:
    1. Make dᵢ equal to the remainder of the division of N by the base b
    2. Make N equal to the floor of the division of N by the base b
  3. Return the string of digits dₙdₙ₋₁…d₁d₀

For a computer, we can rewrite this second algorithm more efficiently as follows:

  1. Create a list L of digits L
  2. While N is greater than 0:
    1. Append the remainder of the division of N by the base b to L
    2. Make N equal to the floor of the division of N by the base b
  3. Return the items in L in reverse order

Negative numbers

When writing programs, usually we can represent negative numbers by appending a minus before for example -42 or -0x2a.

How these numbers are represented in the computer varies from architecture to architecture and will depend on the data type. For example negative numbers could be represented having the most significative bit to be 1 if negative or 0 is positive in a simple sign value approach. For example using 8 bits 42 in binary would be 00101010 and -42 would be 10101010.

Other approaches exist to represent negative numbers. For example one’s complement would also use the leftmost bit as 0 for positive numbers and as 1 for negative numbers but, negative numbers would have all their bits shifted. Thus 42 would still be 00101010 but -42 would become 11010101. In this representation the absolute magnitude of the original number would be calculated by flipping all of the bits if the most significative bit is 1.

A final approach is called two’s complement. It is similar to one’s complement but also prevents the number 0 from being representable as both 0 (all bits to zero) and -0 (all bits to one). In two’s complement positive numbers are left as is but negative numbers have all their bits flipped and then the number 1 added to them. Thus 42 would still be 00101010 but -42 would become 11010101+1=11010110. Obtaining the absolute magnitude works almost in the same way, if the most significative bit we leave the number as is but if it is one, we substract one from it and then flip all of its bits. Notice that because there is one more number in one direction than the other, smallest possible negative number (for example for eight bits 1000000 or -128) would still become itself after applying the algorithm. This would cause undefined behavior on C where the operand – over a signed number returns another signed number.

For numbers in the integer domain, the majority of modern computer architectures use two’s complement representation internally this also includes RISC-V. Numbers in the floating point domain usually use more complex representations so we will not cover them here.

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos necesarios están marcados *

*
To prove you're a person (not a spam script), type the security word shown in the picture.
Anti-spam image