Strings, Encoding and Unicode
Since the advent of personal computers and desktop publishing, more and more people typeset fonts on a screen. This practice was once confined to a professional category. Nowadays typesetting is ubiquitous. It is so widespread that writing and typesetting are almost considered the same thing. The differences between the two realms are important and a visual designer should be aware of them.
A broad notion of Writing (do not mistake Calligraphy for Writing) could be considered the visual arrangement of language on a support to convey meaning. It is indeed a very broad definition. Somehow, even performative arts fit it. Writing implies the use of any tool on any kind of surface, as long as it arranges visual language. What about Braille notation? That is Writing too, but it is a bit out of the scope of this manual.
The definition of Typography is narrower. Typography is the arrangement of visual language using elements especially designed for reproduction and reuse: types. These types can be made of lead, film, bytes. It doesn’t matter as long as they are prefabricated. So, Typography is definitely a subset of Writing. Calligraphy or Lettering are other subsets of Writing and they share with Typography the fact that they all deal with letter shapes, but in very different ways.
Gutenberg and Movable Type
Think about this: if you ask yourself what actually Gutenberg invented, you will not easily find an “object” to imagine. Presses, paper, ink, metal stamps, the Latin alphabet, books: they were already there. Can we name more precisely his invention? Of course: Gutenberg made the continuous surface of Writing discrete, which means consisting of separated elements.
We could state that making something discrete is defining an atom or a set of rules which will reduce substantially the possibilities allowed by the system: from the widest infinite set of options to a narrower infinite set of options. It means setting one or more thresholds which will define unambiguously which information will be discarded or not. Take for example the scale plans in architecture: the 1 to something value defines the threshold in representation which sets the amount of information and detail conveyed by the image.
How could a loss of detail and options be a good thing? Well, the idea is to discard what we don’t need to achieve a certain goal, in order to make our process more efficient. Less options means quicker transformations, less data to transfer, quicker testing and so on.
In different words, Gutenberg reduced the inexhaustible possibilities of Writing to speed up the arrangement and reproduction of lines of text. The loss in composition freedom was substantial, but the gain in speed and ease of production was huge.
In the case of Gutenberg, this reduction meant that he had to identify and select the most occurring components of Writing – in his context the letters of the Latin alphabet – and make them fit into a box according to a number of restrictions. These restrictions were supposed to make these boxes interchangeable. Everything else outside this selection of elements started to be treated as illustration, carved in wooden blocks and kept separate from “text”.
Ok, but… the letters were already separate things before Gutenberg’s invention, right? A is A, B is B and so on. That’s true, but when one writes, say, by hand the surface is not sliced, it is continuous. If we compare Writing to Music, this is probably easier to understand. Typography, as any visual arrangement, consists in as much white as black. It’s not black on white, it’s black and white. It’s as much ink as paper. As much active pixels as inactive ones. As much fill as void. Now, transpose this to Music. What would be a note without the absence of sound surrounding it? The absence of sound contributes to the construction of the meaning as well as the presence of sound. What would be a typographic composition without the white surrounding the black? Just some messy spots of ink.
What Gutenberg did was to break the continuum of Writing into rectangles of regular height. Because of production constraints, Gutenberg also wanted to have the graphism (the black shape) entirely contained into a box. The easiest and more logic way to do so was to cut between each black shape. This means that if the graphism is untouched, the void (white) needs to be sliced in different pieces and distributed across the boxes.
This re-organization of the surface of Writing is condensed and represented in a tool that Gutenberg actually invented: the adjustable mould. This tool allowed him to cast many little lead boxes with the same height but different width. Here follows a synthetic scheme of the process:
Digital Typesetting, from Solid to Virtual Boxes
Digital typesetting freed Typography from the physical limits of metal type. Even if it transported type from a three-dimensional to a two-dimensional one, it shares with its old ancestor the same strain for modularity and some customs.
For example, given that metal type consisted of physical elements used as bricks forming a wall of prefabricated letters, it made sense to keep track of the height of the box instead of the height of the glyphs represented in the box. This custom is still in use nowadays even if the boxes are completely virtual. Take this aspect into account when comparing two different fonts. Often, even if you use the same body size, the resulting letters will not be of the same height.
Characters and Mapping
Before exploring the realm of typesetting in Python, we need to learn how Python treats text data. Computers in general need to encode any kind of information down into binary notation. Text is not an exception. The way this encoding issue was originally solved is based on the alphabet and is therefore Western-centric. Each text element such as letters, digits, symbols, punctuation, white spaces and control elements as new line or delete was assigned to a number. The weird thing is that each component of a standard typographer case received a value. For example, A and a got a different value even if they represent the same letter in a different drawing structure. This is not a big deal with Latin, but this approach had serious consequences when mapping Hangul, Arabic and many non-alphabetical scripts.
The process of mapping a character to a number is called encoding. Morse Code is a very old kind of encoding. One of the oldest and most widespread standard for text encoding for the computer era is the (in)famous ASCII (American Standard Code for Information Interchange). Its origin goes way back to telegraph communication.
Because of its heritage and its special relation with the English language, ASCII characters are represented using a 7 bit integer value, meaning that the maximum index available in the mapping can be described with a binary number made of 7 digits.
2⁶ | 2⁵ | 2⁴ | 2³ | 2² | 2¹ | 2⁰ | ||
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | |
[...] | ||||||||
125 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | |
126 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | |
127 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2⁶+2⁵+2⁴+2³+2²+2¹+2⁰ = 127
If we include the 0, the available indexes sum up to 128 options.
When 8-bit bytes became the standard in computing, each character started to be represented with an 8-digit binary number (1 byte). This opened up the mapping to 256 options.
2⁷ | 2⁶ | 2⁵ | 2⁴ | 2³ | 2² | 2¹ | 2⁰ | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
[...] | ||||||||
253 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
254 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
255 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2⁷+2⁶+2⁵+2⁴+2³+2²+2¹+2⁰ = 255
This space of the mapping was often implemented in different ways across different operating systems, such as Apple (macOS Roman), IBM and Microsoft (ISO 8859) and countries. Exchanging information across different computers became more difficult, until the Unicode Consortium managed to create a new wider OS-independent mapping standard. The Unicode Consortium finally began to look beyond English speaking countries. The last bit of the first byte was then fixed to ISO LATIN 1 as you see in the following diagram.
Each character in the Unicode mapping can be represented as two bytes opening the mapping options to 65,536. The mapping can be extended beyond the two bytes, in fact emojis are stored in a supplementary multilingual plane which starts just after U+FFFF. We will sometimes refer to the Unicode mapping using the abbreviation UTF-8, which is a clever way to represent Unicode characters saving data according to their position into the mapping.
Text Data in Python
After this lengthy preamble, it is time to ask: how do we relate with text data using Python? In the specific scenario of this manual, we will use the standard Python string object to create, store and manipulate text. Then we will use DrawBot to typeset the text onto a PDF canvas.
The Python string, like tuples and lists, is a sequence: a collection of values where order is significant. Tuples and lists are generic containers: they can store any kind of data. A string is especially designed for representing immutable sequences of text characters. A schematic representation of a string could be
Sequence indexing starts from the 0. Meaning that a string of n elements has the elements indexed from 0 to n-1 inclusive.
A string can be created using single '
(U+0022) or double "
(U+0027) straight quotes. Two options are convenient in case you need to insert a delimiter character into the string representation:
"don't look at me!"
'he said "how dare you?" with an angry tone!'
Otherwise, an escape backslash character can be used to inform the interpreter that the character following the backslash is not the end of the sequence but part of it:
'don\'t look at me!'
Other commonly escaped characters are the newline (\n
) and the tab (\t
). In addition, Python supports triple quotes string declaration for strings which embed newline characters naturally. These are used especially for code documentation purposes.
aMultiLineString = """this
is
an ugly
multiline
string"""
As said before, each character is mapped to a numerical value according to the Unicode standard. Python provides the built-in function ord()
to get the position of a character within the standard mapping.
ord('À') # 192
Conversely, Python gives the possibility to access a mapping position using an integer value through the chr()
function. Given that each number is assigned to one and only character, using 192 as input value we will obtain:
chr(192) # 'À'
References to unicode characters are rarely made using their base 10 value. As we said, most of the characters have a two bytes representation, which can be represented conveniently with a four digits hexadecimal number, as:
utf-8 hex | character | decimal |
---|---|---|
U+002A | * | 42 |
U+0036 | 6 | 54 |
U+2030 | ‰ | 8240 |
These notation can also be included in Python string declaration using the \u
escape literal:
'\u002A' # *
Strings, as the other basic data types, have a built-in constructor function named str()
. This function allows the conversion of a different type of object into a string. It accepts almost any kind of object, but it has a limited range of options.
str(2018) # '2018'
str(3.14) # '3.14'
str(False) # 'False'
str(None) # 'None'
Way more detailed conversion methods are explained into Transform Strings.
As mentioned before, strings are sequences. Python provides a specific syntax to perform some basic sequence operations like accessing, slicing, checking inclusion, test for equivalence and concatenation.
Accessing
It is possible to access the content of a sequence using an index from 0 to the sequence length minus 1.
'hello'[0] #'h'
'hello'[1] #'e'
'hello'[2] #'l'
'hello'[3] #'l'
'hello'[4] #'o'
Python gives also the possibility to use a negative index. This could be read as sequence length minus index. It’s like starting from the end. For example:
'hello'[-1] #'o', 5-1 = 4
'hello'[-2] #'l', 5-2 = 3
'hello'[-3] #'l', 5-3 = 2
'hello'[-4] #'e', 5-2 = 1
'hello'[-5] #'h', 5-1 = 0
Take into account that providing an index equal or higher than the sequence length will cause an IndexError
during runtime.
Slicing
You can create a sub-sequence deep copy using an extension of the accessing syntax. It works like this:
sequence[start:end:step]
Remember that end is the index of the first value not to be included into the subsequence.
'abcdefghijklmnopqrstuvwxyz'[4:24:2] # egikmoqsuw
The arguments can be omitted in order to make them implicit, in the following way:
sequence[start:end:] #or sequence[start:end]
sequence[start::] #or sequence[start:]
sequence[:end:] #or sequence[:end]
sequence[::step]
As for the accessing syntax, start and end can have a negative value.
'abcdefghijklmnopqrstuvwxyz'[:-2:2] # acegikmoqsuw
Checking inclusion
value in sequence
value not in sequence
It triggers a boolean expression able to verify whether a sequence contains a value or not. For example:
'l' in 'hello' # True
'm' in 'hello' # False
'l' not in 'hello' # False
'm' not in 'hello' # True
This expression can accept sub-sequences as
'll' in 'hello' # True
'hl' in 'hello' # False, even though both letters are in the second string, they don’t follow the same order
Test equality
If you need to compare the content of two different strings, you should use the equality operators.
"hello" == "hello"
# True
"world" != "World"
# True
Beware of identity operators. Short strings are heavily cached; therefore their identity is not reliable. So, always prefer 'Meooow' != 'meooow'
to 'Meooow' is not 'meooow'
.
Natural order
The following operators can be helpful to determine how two strings relate according to a lexicographical order.
>, =>, <, =<
Why not alphabetical? We could consider “lexicographical” as a wider notion of alphabetical, where the sorting references a set of characters wider than the Latin alphabet. Think for a moment what would happen if you ask the interpreter which one comes first between some Latin and some Cyrillic characters? The reference character set is the Unicode mapping. Let’s consider:
'habit' < 'hat'
In order to solve such expression, Python converts each character into its integer unicode representation, and then evaluates their order starting from left
'habit' # 104 97 98 105 116
'hat' # 104 97 116
The first two character representations are equal, but the third one (b
vs. t
) is not, meaning that "habit"
is inferior to "hat"
. In fact the expression evaluates to True
. This applies to our common notion of alphabetical order. Then consider the following
'Mango' > 'blueberry'
We would also expect this expression evaluates to True, because the 'b'
comes before 'm'
in the Latin alphabet. Well, not really. Because the uppercase 'M'
comes before the lowercase 'b'
in the Unicode mapping. I know it doesn’t make sense at first sight, but if capital and lowercase characters occupy two different spots into the encoding, this is somehow a mandatory behaviour. Let’s check the numbers:
'Mango' # 77 97 110 103 111
'blueberry' # 98 108 117 101 98 101 114 114 121
Which means that
'Mango' > 'blueberry' # False
Concatenation
Python allows the concatenation of sequences using the +
and *
operators. For example
'abcdef' + 'ghijkl' # abcdefghijkl
'abcdef' * 3 # abcdefabcdefabcdef
Note that a sequence can be multiplied only by an integer value, no floating points allowed. Note that this operation is not memory efficient. If you need to concatenate intensively, consider using the string .join()
method.