Global AI and Data Science

 View Only

Bits, Bytes & Code Points

By Rathish Poovadan posted Thu August 02, 2018 08:52 PM

  
While learning Python, you may have come across the usage of terms like Bytes, Encoding, Decoding etc. If you were overwhelmed with those terms, below is an explanation in (hopefully) layman's  approach. This is the first of the multi-part series that I plan to write on this subject.

Bits, Bytes & Code Points

You might have heard while learning about Computers that they understand just two characters - 0 & 1. If you have ever wondered how we have the ability to type in all the numbers, alphabets, accented characters or even those wonderful Emojis these days, this is the article you need to read.

To understand in depth, I have to take you down the memory lane, literally, the Computer Memory. Zooming past the Terra, Mega and Kilo Bytes, lets focus on the very basic part of the Memory - A single Byte. Without getting into much technical aspects, let me put forward the idea that the electronic component (hardware) for Memory consists of tiny flip-flop switches that can be turned on or off (just like household switch to turn the lights on or off). For the ease of communicating, we represent the On state as 1 and the Off state as 0.

Consider each such switch to be a 'Bit'. Eight such Bits are grouped to call a Byte. Historically, while designing the architecture for IBM's 7030 project (and later models), they made a table of all the common characters (alphabets, numbers, symbols etc.) in use in Typewriters available then along with some other symbols. They then assigned a unique combination of Binary numbers (the numbers that have only two digits - 0 & 1). They found that a combination of eight bits represents all the characters in use then and a few more for future use. More importantly, 8 bits allows usage of two 4-bit (Nibbles) to be used for calculations. The use of 8 bits for a Byte continues even today!

The table on the left is an actual screenshot from the design manual of the IBM 7030. Remember that every bit can represent 2 values (1 or 0). Two bits can hold 4 values (00, 01, 10, 11) and three bits can hold 8 values (2 to the power of 3) and so on. In that way, eight bits can hold 2 to the power of 8 or 256 different values. The idea is simply to assign every human readable character to an index number which can be represented in a binary form. So, in an IBM 7030 machine, if we try typing the letter 'a', which is assigned 44 (Decimal), the system takes it as 00101100.

What we just did above is called 'Encoding'. Remember this term as we'll be using it frequently. Meanwhile, American National Standards Institute, then known as ASA, started working on new Teleprinting / Telegraph codes called ASCII.

The Era of incompatibility

Soon after computers gained prominence for usage in general computing, the need for the ability of the computers to use more of human readable characters increased. The ASCII codes, mainly catered to the Americans English users. This was one of the most famous character mapping system - all the characters, including small and capital English letters, numbers, punctuation marks and other special characters packed in it. Overall, it had unique codes for 128 characters. One drawback of course was, it didn't represent Latin or other European characters.

The International Standards Organization (ISO) had extended the feature set in ASCII to include additional characters. This was possible because there were only 128 ASCII codes, which can be represented by 7 bits (2 to the power of 7). However, the machines were using Memory that had 8 bits in a Byte. That means each byte of memory can represent one of up to 256 characters (2 ^ 8 = 256). With ASCII taking only 128 characters, there was space left for 128 more characters. The ISO came up with ISO 8859 character set, wherein, along with the standard character sets, they provided with variations that accommodated Latin, Cyrillic, Arabic, Hebrew etc.

At around the same time many other standards emerged to encode characters like CP437, Windows-1252, Mac OS Roman for English / European variants. Similarly, the encoding systems for other parts of the world emerged as well e.g. ISCII for Indian Text, JIS X 0208 for Japanese, GB 2312 for Chinese, KS X 1001 for Korean etc.

The fundamental issue with so many encoding systems is that if you encode text in one format, the system will not show it properly unless the right encoding is selected. This generally was not a problem in the initial days. However, after the advent of Internet and world wide web, this became a clear problem. And there was no way that a single document could handle multiple encoding simultaneously. Literally, it was like carrying a bunch of keys, you need to know which key worked for which lock!

The United Era!

Although the Unicode Consortium was incorporated in 1991, the work began as early as 1988 when a few of the founding members joined hands to create what was Unicode, or in their words "Unique, unified and universal encoding". I will shortly be publishing an article exclusively on the working of Unicode. However, before that a brief good to know intro to it will be apt. The Unicode way of representing and presenting text is divided in to two categories. One, assigning a number (or technically a Code Point as in 'U+1F603' for a Smiley) to each character ever used by any human being (okay, I agree that was too dramatic!). And two, facilitate the encoding of the character using one or more bytes so that it helps the computer understand (transform) the way it understands the best - 0s & 1s!


#GlobalAIandDataScience
#GlobalDataScience
0 comments
19 views

Permalink