Variable Waterfall Poster #1
The aim of this two-part tutorial is to guide you through the writing of a program able to typeset a "waterfall" poster using random words from a predefined list. A "waterfall" is a common configuration in specimens, usually presenting various weights of a family or sizes of a font. Here we will try to create a program slightly more generic, able to present instances from any kind of variable font axes, not just weight. The user decides what to show: width, slant, optical size, you name it. It all depends on the font you choose to typeset with. The first part of the tutorial takes care of the basic structure working with the neutral form of a variable font, the second part will make the poster "variable". The font in question is Skia, originally a GX font by Matthew Carter – now available in OTVar format –, preinstalled on any Apple computer. Skia has width and weight axes. Try it yourself.
We need some words to typeset this poster. Any large list of words will work, if you don't have any, you can start with these .txt
files from Nina Stössinger's Word-o-mat Robofont Extension. (Or, you could explore the resources of the Python NLTK). These files present a word for each line and they have a credits section on top of the file, delimited by five asterisks *****
. Here a short sample
Edited and abbreviated version of wordlist originally compiled by Hermit Dave from public/free subtitle sources, used with permission. Thanks!
Source: http://invokeit.wordpress.com/frequency-word-lists/
April 2014: Edited by Nina Stössinger, for use in word-o-mat Robofont extension. I have truncated the list and removed frequency counts.
May 2015: Edited by Roberto Arista, thanks to 'Morph-it!'
Source: http://dev.sslmit.unibo.it/linguistics/morph-it.php
Licensed under Creative Commons – Attribution / ShareAlike 3.0 license
http://creativecommons.org/licenses/by-sa/3.0/
*****
non
che
di
la
il
un
è
per
in
...
The first bit of code we will need is a function to load these words in memory. The function interface should at least accept the location on the disk of a .txt
file
The with
context ensures that the .txt
file will be automatically closed as soon as the interpreter leaves the code block. In addition, you have probably noticed that we are not using a list comprehension to pack the words into a collection, we are using instead a set comprehension. Why not using a list? A list would work too, but I have a rule of thumb for choosing between lists and sets: if you need unique elements and you don't need elements to stay in a specific order, use a set. You are stating your intentions more clearly –good for semantics– and you'd probably gain a lot of speed –good for performance–. rstrip()
is necessary to clean the '\n'
characters at the end of each line.
Now it's time to filter the credits. Considered that the length of the credits is not constant across these files, we should use the five asterisks *****
as a flag. We can loop through all the file lines and switch a boolean value to True
as soon as we encounter the asterisks and start to collect the words into a set.
The rstrip()
string method is now invoked from a generator comprehension expression, directly as iterable of the for loop. Avoid using lists if you need to access the data just once, that could sensibly slow down the program if the list is large enough. Generator expressions instead yield one element at a time, without loading the entire collection in memory. Note that we activate the asterisks switch at the end of the loop body, to filter the asterisks out of the set.
At this point, we can easily add a minChars
filter to skim word too short for the poster.
To get a tight-packed poster, we should consider the opportunity of transforming each string to uppercase. But I don't think is a good idea to bury the conversion inside the loadWords()
function, we should be able to address this behaviour from the function interface. Python functions are first-class object as any class, int()
, bool()
, so we can pass a conversion function directly in the loadWords()
function interface as argument. We just need to wrap the string .upper()
method into an uppercase(txt)
function.
Once we have loaded the words in memory, we need to evaluate the width they develop with the font we want to proof. That's another bit of logic, so we should wrap this code into another function, calcWordsWidth()
. The function interface needs a collection of strings – they don't necessarily need to be words – and a postscript font name of a font installed on the computer.
We are going to use the font()
and textSize()
methods from the Drawbot API. If you are programming from a different environment than the Drawbot application, remember to import the module using import drawBot
or – as I prefer – import drawBot as dB
Text width is directly proportional to point size, which means that we don't need to calculate the width of each word at different point sizes, we only need to do it for pointSize = 1
, then we can scale the measure as we need it. (Well, the geeks out there will probably point that this is not necessarily true for fonts with an optical size axis. In that case, it is possible to extend the approach used in part two)
Bottom line, the calcWordsWidth()
function loops through the words set and collects the width of each word typeset with a given font at pointSize = 1
. The words are then grouped by width, using a very convenient defaultdict(list)
Standard Python's floating points are way too detailed for the scale we are working on. This results in very small groups, according to the calculations in the previous script, we have an average of ~2.5 words per group. This means that we will likely obtain the same word more than once in the same poster, definitely not optimal. The risk is to have very little variability in exchange for a detail we don't really need. To avoid this we should make our calculation a bit less precise, using the round()
built-in function. (Or, collect the widths at pointSize=1000
and truncate the result to int
)
So, now it's time for visual proof, let's write the drawPoster()
function. How? We can use matrix transformations to move the reference point of each line in the poster; each line will be occupied by a word coming from the group with the smallest difference from the poster net width (we can't expect to find a group of words with the EXACT same measure we need). This bit of logic could be wrapped into a findNearestWord()
function. (Never used a lambda
function? This how-to sorting could be an interesting read).
The findNearestGroupOfWords()
function will need access to the width_2_words
dictionary, the available width on the page, and the point size. We should also find a way to randomly choose a word without repeating it more than once on the page (if a sufficient number of words is available). Here follows a possibility
findNearestGroupOfWords()
creates a copy – not a reference – of the words contained into width_2_words
. This copy is created through a list()
constructor (so, not really a copy). With one move we become able to address the order of the collection – moving from set()
to list()
and we are also sure that we are not going to mess with the source. The list is shuffled each time the function is invoked and the words are picked with a list.pop()
method. If the amount of words is less than the number of lines we need, findNearestGroupOfWords()
is invoked again protecting us from an IndexError
. Let's extend this code to include some typesetting
Here follow three iterations of the previous code (background color and composition made with another script)
If you try to generate a few posters (say 10), you will notice that the process is rather slow, it takes a few seconds at least. The reason is easily explained, the width_2_words
is recalculated each time the drawPoster()
function is invoked. Totally unnecessary if the values influencing the calculation remain constant. The best solution is to save the data on disk, implementing a simple cache system.
The JSON file type works fairly well for this kind of operation. It is text-based, so we can see what's inside, and Python comes with a battery-included module to read and write valid JSON. Think of JSON as a lightweight database solution. In fact, is used all across the web to share data between servers and clients.
So, our program should perform the calculation only if a cache file is not on disk, and we can easily check this condition by using the .exists()
method of a Path()
object from the pathlib
module. Remember to save the JSON using a name reporting all the variables that can influence the calculation, right now language
and fontName
. We don't want to use SanFrancisco instead of Skia, or load English words instead of Italian words, and so on. That's very easy using an f-string.
Here's an example of what a JSON looks like
"2.42": [
"SIRIA",
"FIGLI"
],
"7.18": [
"VENTILAZIONE",
"SPIEGHEREBBE",
"AVANGUARDIA",
"FANTASCIENZA",
"PEGGIORANDO",
"AUMENTANDO"
],
"2.57": [
"FALLI",
"FATTI",
"TETTI",
"PATTI"
],
Very similar to a Python dictionary, right? Consider that JSON cannot store keys as floating points and sets as values, so these will be converted respectively to strings and lists while writing with the json.dump()
function. We'll have to convert them back when we load the file from the disk. It's just a oneliner with a dictionary comprehension.
The performance gain is significant. Here are the results of a quick comparison made with the time()
Python module. With the list of the English words, the gain is 400 to 1! 🚀
-----------
calculation italian.txt: 2.490s
load json italian.txt: 0.013s
-----------
calculation english.txt: 32.169s
load json english.txt: 0.079s
-----------
The entire script now looks like this
Great! We have laid out all the basics for our variable waterfall poster. Over the next article, we will expand the functionalities of this program to take full advantage of the variable font technology. If you want to explore a different set of tools, here you can find another script that achieves a similar result but using fontTools instead of Drawbot's textSize()
.