Strings, Encoding and Unicode

Since the advent of personal computers and desktop publishing, more and more people typeset fonts on a screen. This practice was once confined to a professional category. Nowadays typesetting is ubiquitous. It is so widespread that writing and typesetting are almost considered the same thing. The differences between the two realms are important and a visual designer should be aware of them.

A broad notion of Writing (do not mistake Calligraphy for Writing) could be considered the visual arrangement of language on a support to convey meaning. It is indeed a very broad definition. Somehow, even performative arts fit it. Writing implies the use of any tool on any kind of surface, as long as it arranges visual language. What about Braille notation? That is Writing too, but it is a bit out of the scope of this manual.

The definition of Typography is narrower. Typography is the arrangement of visual language using elements especially designed for reproduction and reuse: types. These types can be made of lead, film, bytes. It doesn’t matter as long as they are prefabricated. So, Typography is definitely a subset of Writing. Calligraphy or Lettering are other subsets of Writing and they share with Typography the fact that they all deal with letter shapes, but in very different ways.

The Invention of Movable Type

Movable type technology made its first appearances in Asia around 1000 AD, hundreds of years before Europe. Around 1450, Johannes Gutenberg, a goldsmith based in Mainz, introduced the metal movable-type printing presses in Europe starting the Printing Revolution.

Think about it: if you ask yourself what actually movable type's pioneers invented, you will not easily find an “object” to imagine. Presses, paper, ink, metal stamps, writing, books: they were already there. Can we name more precisely their invention? Of course: they made the continuous surface of Writing discrete, which means consisting of separated elements.

We could state that making something discrete is defining an atom or a set of rules which will reduce substantially the possibilities allowed by the system: from the widest infinite set of options to a narrower infinite set of options. It means setting one or more thresholds that will define unambiguously which information will be discarded or not. Take for example the scale plans in architecture: the 1 to something value defines the threshold in representation which sets the amount of information and detail conveyed by the image.

How could a loss of detail and options be a good thing? Well, the idea is to discard what we don’t need to achieve a certain goal, in order to make our process more efficient. Fewer options mean quicker transformations, fewer data to transfer, quicker testing, and so on.

In different words, movable type reduces the inexhaustible possibilities of Writing to speed up the arrangement and reproduction of lines of text. The loss in composition freedom was substantial, but the gain in speed and ease of production was huge.

If we focus on Europe, where languages are mostly written using the Latin alphabet, the reduction meant that Gutenberg had to identify and select the most occurring components of Writing – in his context the letters of the Latin alphabet – and make them fit into a box according to a number of restrictions. These restrictions were supposed to make these boxes interchangeable. Everything else outside this selection of elements started to be treated as an illustration, carved in wooden blocks and kept separate from “text”.

Ok, but… the letters were already separate things before movable type, right? A is A, B is B, and so on. That’s true, but when one writes, say, by hand, the surface is not sliced, it is continuous. If we compare Writing to Music, this is probably easier to understand. Typography, like any visual arrangement, consists of as much white as black. It’s not black on white, it’s black AND white. It’s as much ink as paper. As many active pixels as inactive ones. As much fill as void. Now, transpose this to Music. What would be a note without the absence of sound surrounding it? The absence of sound contributes to the construction of the meaning as well as the presence of sound. What would be a typographic composition without the white surrounding the black? Just some messy spots of ink.

Movable type allows to break the continuum of Writing into rectangles of regular height. Because of production constraints, the graphism (the black shape) needs to be entirely contained into a box. The easiest and more logic way to do so was to cut between each black shape. This means that if the graphism is untouched, the void (white) needs to be sliced in different pieces and distributed across the boxes.

This re-organization of the surface of Writing is condensed and represented in a tool that Gutenberg actually invented: the adjustable mould. This tool allowed him to cast many little lead boxes with the same height but different width. Here follows a synthetic scheme of the process:

Digital Typesetting, from Solid to Virtual Boxes

Digital typesetting freed Typography from the physical limits of metal type. Even if it transported type from a three-dimensional to a two-dimensional one, it shares with its old ancestor the same strain for modularity and some customs.

For example, given that metal type consisted of physical elements used as bricks forming a wall of prefabricated letters, it made sense to keep track of the height of the box instead of the height of the glyphs represented in the box. This custom is still in use nowadays even if the boxes are completely virtual. Take this aspect into account when comparing two different fonts. Often, even if you use the same body size, the resulting letters will not be of the same height.

Characters and Mapping

Before exploring the realm of typesetting in Python, we need to learn how Python treats text data. Computers in general need to encode any kind of information down into binary notation. Text is not an exception. The way this encoding issue was originally solved is based on the alphabet and is therefore Western-centric. Each text element such as letters, digits, symbols, punctuation, white spaces and control elements as new line or delete was assigned to a number. The weird thing is that each component of a standard typographer case received a value. For example, A and a got a different value even if they represent the same letter in a different drawing structure. This is not a big deal with Latin, but this approach had serious consequences when mapping Hangul, Arabic and many non-alphabetical scripts.

The process of mapping a character to a number is called encoding. Morse Code is a very old kind of encoding. One of the oldest and most widespread standard for text encoding for the computer era is the (in)famous ASCII (American Standard Code for Information Interchange). Its origin goes way back to telegraph communication.

Because of its heritage and its special relation with the English language, ASCII characters are represented using a 7 bit integer value, meaning that the maximum index available in the mapping can be described with a binary number made of 7 digits.

	`2⁶`	`2⁵`	`2⁴`	`2³`	`2²`	`2¹`	`2⁰`
`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`
`1`	`0`	`0`	`0`	`0`	`0`	`0`	`1`
`2`	`0`	`0`	`0`	`0`	`0`	`1`	`0`
`[...]`
`125`	`1`	`1`	`1`	`1`	`1`	`0`	`1`
`126`	`1`	`1`	`1`	`1`	`1`	`1`	`0`
`127`	`1`	`1`	`1`	`1`	`1`	`1`	`1`

	`2⁷`	`2⁶`	`2⁵`	`2⁴`	`2³`	`2²`	`2¹`	`2⁰`
`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`0`
`1`	`0`	`0`	`0`	`0`	`0`	`0`	`0`	`1`
`2`	`0`	`0`	`0`	`0`	`0`	`0`	`1`	`0`
`[...]`
`253`	`1`	`1`	`1`	`1`	`1`	`1`	`0`	`1`
`254`	`1`	`1`	`1`	`1`	`1`	`1`	`1`	`0`
`255`	`1`	`1`	`1`	`1`	`1`	`1`	`1`	`1`

utf-8 hex	character	decimal
`U+002A`	`*`	`42`
`U+0036`	`6`	`54`
`U+2030`	`‰`	`8240`

Strings, Encoding and Unicode

The Invention of Movable Type

Digital Typesetting, from Solid to Virtual Boxes

Characters and Mapping

Text Data in Python

Accessing

Slicing

Checking inclusion

Test equality

Natural order

Concatenation