Books

So at this point, I had a device that can set its own time, read data off an SD card, and display stuff on a scrolling LED display. Now, I just had to find a way to get 19,000 some books onto it.

The Project Gutenberg page makes it relatively easy to browse their selection and search for particular books, but since I was looking to indiscriminately download all of their books, I needed something better.

In 2010, they released a DVD of 29,500 titles that is available via bittorrent. After downloading the 7.7GB .iso file (and getting my U/D ration over 10 to pay my dues), I took a look inside. The file contains an HTML file that is more or less an offline version of their website. Instead of linking to files online, browsing through it links to local copies of zip files that contain the desired content.

Browsing these files by hand is a nightmare however as they are organized in a folder structure that (I hope) was auto generated and not really meant to be accessed directly by people.

What I did know is that the content that I wanted was all inside .zip files, so I started with an Automator script to gather and group together all of the .zip files.

This didn’t really work though because Automator is terrible, and many of the files were missed. I used Spotlight to search for the remaining files and manually copied and pasted them:

This left me with a metric butt-ton of .zip files. The filenames don’t really say anything about the contents, so the next step was to unzip them. I ordinarily would have used an unzip command in the terminal with an asterisk input to signify that I wanted to unzip all of the files in the folder, but that apparently stops working when the number of files get ridiculous. Instead, I had to use this:

find . -name '*.zip' -exec unzip {} \;

Which runs the unzip command on each file individually.

As you can imagine, this took a while, and it was especially annoying since many text files included book cover images that all had the same name:

With my butt-load of zips now a butt-load of files and folders, I set about organizing. Project Gutenberg doesn’t only contain books in txt format. There are several other filetypes such as images, music files, and HTML books. Music folders end in “-m” and html end in “-h”. This made it fairly easy to sort them out.

A few more Spotlight searches and Automator scripts later, and I had a folder containing 21,972 files.

I now had to start sorting and organizing them. I wrote a gigantic Python script that does this automatically (albeit very inefficiently). I’ll describe the steps below.

Book Processing

Outliers

Very few books were weird and had to be removed or fixed. 18251-mac.txt caused issues with Python which is fine because it was a duplicate of 18251-8.txt. Books whose filenames started with 10681 were parts of a thesaurus which I chose to remove. 12670-8.txt has no carriage return at the end of its first line.

There were a few other problems, but the moral of the story is that it’s difficult to get 20,000+ files formatted perfectly, and it was a pain to discover the outliers when a piece of code suddenly broke halfway through processing 20,000 titles.

Remove UTF-8 files

Some files have multiple copies with some ending in “-0”, some ending in “-8”, and some with no suffix at all. The “-0” indicates that they are UTF-8 encoding. UTF-8 is a text encoding scheme that allows for a huge amount of special characters (few of which my LED display supports). The “-8” files are in ISO-8859-1 format which is another name for ASCII. This is a little strange since the files with no suffix are also ASCII.

The only difference I could detect between the ISO-8869-1 files and the ASCII files was a line near the top of the file describing its format.

While the preference is that all files be in ASCII, I wasn’t about to remove all of the files that were only available in UTF-8. My script simply removed duplicates with preference toward ISO-8859-1.

Non-English

Pretty early on, I noticed that some files were causing trouble with my scripts. A closer look showed the problem:

The next step was to remove non-english files. Like the file format, each file had a line near the top describing its language. All I had to do was search through the file for the first instance of “anguage” (capitalization of this word varied, so I accepted both) and see if the word “English” or “english” followed it. If no results turned up, the file was removed.

Remove headers and footers

Every file contains a header with information such as the title, author, date added to Project Gutenberg, and file format as well as a footer containing the Project Gutenberg license. According to the license, Project Gutenberg has a copyright on their compilation, but if all references to Project Gutenberg are removed from the files then they are just out of copyright files which anyone is free to use for whatever purpose.

So the next step was to remove any reference to PG from the files.

The first line of every book contained the book’s title usually in a format like:

Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

I wanted my clock to display the book’s title, but not the the phrase “Project Gutenberg’s”. The formatting of this line varied. Some books had it written as above, while a majority said “The Project Gutenberg EBook of” with a few variations on capitalization of EBook and inclusion of the word “of”. A small number (30 or so) had completely different formatting.

With all of the title variations accounted for, I set about cleaning up the files.

First, I took all of the text after “Project Gutenberg’s” or whatever and pasted it into the first line of a new file.

Next, I had to locate the end of the header section. The header always ended with something like:

*** START OF THIS PROJECT GUTENBERG EBOOK THROUGH THE LOOKING-GLASS ***

So I just looked for three consecutive asterisks and copied all of the text below that into the new file until I hit the footer. The footer always started with something like this:

*** END OF THIS PROJECT GUTENBERG EBOOK THROUGH THE LOOKING-GLASS ***

Though there were a few other variations that I had to account for.

Reformat paragraphs

As with my prototype, I wanted the Gutenberg Clock to display text one paragraph at a time. This was a problem as the books are formatted into fixed-width columns where carriage returns show up in the middle of sentences. While this is expected for lines of poems, I had to do something about the normal block text:

This was pretty easy as specially formatted lines tend to have leading spaces which are trivial to detect. If lines start with a space, they are copied rote. If they don’t, they have their carriage return chopped off and are joined with the line below.

Chop up long paragraphs

Once all of the split up paragraphs were joined into single lines, I noticed that some of them had too many characters to send to the display at one time. Curious of what kind of book would have a single paragraph with over 6,000 characters, I found this book:

(Fun fact, the clock will scroll these numbers for approximately 36 hours)

The solution was to find the first space character that appeared within the last 5,000 characters of a line and split it there. Then repeat with the remainder of the line. If there were any long lines that had no spaces (Chinese text that escaped my previous check), the files were removed. The aforementioned book didn’t have a problem since moving the paragraph onto a single line added spaces where the line breaks originally were.

Unidecode

Many of the books were still in UTF-8 which worked for the most part until a weird special character came up. My display can’t even show accent characters, so I had to find some way to remove them. I started by creating a list of allowed characters, but then I had the problem of what to do with disallowed characters. I couldn’t just remove them. Changing á to a would be nice, but then I’d have to write code for every single type of accent.

Fortunately, I stumbled across the python package “Unidecode” which intelligently changes unicode into standard ASCII replacing accents with their unaccented counterparts and even replacing some Greek or Chinese symbols with their phonetic spellings.

Remove leading and trailing whitespace

The display can only show 14 characters at a time, so in many of the books, giant sections of whitespace at the beginning or end of a line (such as in the lines of a poem) caused the display to go completely blank.

I wrote a script which stripped all of the whitespace from either end

Calendar

With all of that done, it was time to start working out exactly which book was going to play when. The plan was to include a single file that contained the filename of each book (now renumbered 0.txt to 19832.txt) and when it was to start being read.

Because I wanted to give the clock the ability to pick up where it left off, it was important to get this right. I wanted to make it so that if two clocks were turned on at different times, they would be reading their books more or less in synch.

As a review, the display has a local buffer of text. It will scroll this buffer continuously on repeat. As soon as it begins receiving new text over serial, it blanks the display and displays the new text as soon as it has been completely received. This doesn’t exactly lend itself to displaying continuously scrolling text for decades, but I was able to get it to work by very carefully timing when I send the next block of text over serial.

The goal is to transmit new text before the old text repeats. This doesn’t have to be super precise though. I added a bunch of blank spaces on to the end of every line to give me a larger window during which the previous line won’t repeat if the next line’s transmission is delayed.

Simply displaying the text with no repeats or glitches was one thing, but this time around, I had to predict exactly how long it would take to display. There were a number of factors contributing to this:

Scroll time

With the prototype I established that it takes about 137ms to display a single character. This was revised to a more accurate 136.39ms this time around. Each line also has 16 characters worth of blanking time added to the end so the visible text can be scrolled off the display before the next set of characters are transmitted.

Transmission time

I discovered late in the project that it takes a non-trivial amount of time to transmit text to the display’s buffer at 9600 baud. Some lines can easily be thousands of characters long, and with 10 bits per character (character + serial start and stop bit), serial address info, and the blanking characters on the end, it can take up to 4-5 seconds to transmit some text.

My solution was to add a delay that was somewhat proportional to the amount of text transmitted. The clock counts time in 136.39ms increments, and I worked out that it takes roughly one of those increments to transmit 130 characters of text. Dividing the length of the transmitted string by 130 gives me my incremental delay.

I also added a bit of a delay for “fixed costs” that I discovered. It always seems to take at least 120ms or so to process the incoming text regardless of how long or short it is.

It’s a bit of a bummer that to the reader, this display can be completely blank for several seconds between paragraphs, but it’s the only way I could get it to work without creating a custom display from scratch.

File access time

As I said before, the SD card is fairly slow. Reading a single line from a file is quick enough, but there are some delays between subsequent files that I had to account for. I fixed this by adding a 20 character delay (2.7278 seconds) to the end of every book.

When it reaches the end of a book, the firmware just dumps a bunch of blanks to the display to keep it dark until the next book is ready.

How long is forever?

With my book calendar done, I worked out that the first book, which was “read” starting at midnight on January 1, 2000, will not be repeated until 8pm on July 13th 2032. I was originally hoping that the books would stretch the entire century without repeating, but it fell short. Instead, they each appear three times.

This places the beginning of the last book to be read at 8:30pm on August 6th 2097. I was hoping to have it roll over into the next century, but unfortunately the GPS data only provides the last two digits of the year. In a sort of Y2.1K bug, there will be no way to determine which years are leap years as they don’t appear on matching years in subsequent centuries.

But you know what? We probably also won’t have the same GPS satellites by then. The capacitors in the clock will have dried up. The non-volatile memory of the display will have worn out. Most importantly, I’ll be dead.

Firmware

Writing firmware for this clock was quite a task. I’ve never done anything that deals with quite so many pointers and arrays in C. In an effort to improve my C skills, I purchased The C Programming Language and started working through it. My code still isn’t too amazing, but it’s cleaner than it’s been.

While this firmware has to interface with a lot of different things, it has a pretty small set of tasks. Every time it boots up, it does the following:

Signal to user that it’s looking for satellites. Instruct them to place antenna near window.
Configure GPS to report time/date and GPS lock data
Wait for GPS to report GPS fix
Retrieve current date/time
Shut off GPS
Tell user it has the time and is “Searching library”
Open log file and figure out which book should be currently displayed based on date/time
Tell user it’s “Running…”
Estimate how far through the book you should be based on start time of book and current time
Start displaying book from that location
At end of book, start on next book
At end of book 19,832, start at book 0

I imagine there will be quite a bit of clock drift and whatnot contributing to books not starting right on time, but seeing how this is an art piece and not an atomic clock, I’m not too concerned. I’ve determined that any clock drift will be due to the inaccuracy of the onboard crystal and not due to a problem with the firmware which is as good as I can hope. I could have it continue to update the clock using the GPS antenna, but that’s an awful lot of work for almost no benefit. Besides, if users want to set out the antenna during boot up and then tuck it back behind the clock during use, they can do that.

User Experience

There isn’t a whole lot to discuss here. To use the clock, the user just has to put the antenna near a window, plug the clock into the wall, and wait. The display will take a few minutes to find satellites (time spent doing this depends on when it was last powered off) before starting to display books. While it’s looking, it even instructs the user to place the antenna near a window.

Because I borrowed so much of the mechanical interface from the off-the-shelf LED marquee, this thing actually looks pretty consumer-ready. The only problem right now is the port for the GPS antenna. I had to remove the serial connection (which was RJ-45), and the cable doesn’t exactly fill the hole very well.

If I could, I would have mounted an SMA connection in this hole and allowed the user to remove/replace the antenna as needed. As is, the cable is attached to the PCB through a small amount of strain relief (tape), which isn’t as durable as I’d like.

Marketing

Because this is one of my more polished projects to date, I wanted to produce a promotional video with an equal amount of polish. I figured that a library was the best place to shoot something like this, and so contacted the Seattle Public Library to see if I could get permission to film.

They asked for a $75 donation per hour of film time, so I thought I'd look around for more low-key options. There's a used book store in my neighborhood which had a look that I think better matched my idea, and the owner of the shop was super nice and let me film for free! The clock is wall-powered, and it would be a pain to have to keep running extension cords and hiding cables, so I cooked up a battery pack for the shoot. The clock needs 5V DC, so I figured that one of those fancy battery backup USB phone chargers would do the trick. I quickly hacked up the DC barrel jack from the power supply of the original LED marquee (the one I fried), soldered it to a USB cable, and picked up one of these: <a href="http://ch00ftech.com/wp-content/uploads/2014/08/IMG_1199.jpg"><img class="aligncenter size-full wp-image-4512" title="IMG_1199" src="http://ch00ftech.com/wp-content/uploads/2014/08/IMG_1199.jpg" alt="" width="1024" height="683" /></a> When I tried connecting the clock to the battery using my new cable, the battery shut off immediately. I tried a few combinations of powering on the battery before plugging in or vice-versa, and I triple checked that I had the polarity on my cable right. The cable worked when plugged into a wall-powered phone charger, just not with this battery. I took some measurements and determined that the average current draw was around 100-200mA, so it wasn't like I was overdriving the battery which can easily source an amp. What I discovered was that my device had too high of an inrush current. When connected to the USB supply, all of the clock's internal bypass caps charge up immediately and suck an enormous amount of current from the battery. The USB spec actually has requirements regarding how much inrush current you're allowed to have, and without taking any specific measurements, I figured that the several thousands of microfarads of capacitance on the board were drawing too much. The solution was to build an inline inrush current limiter. This is a simple circuit that can be made with a few passives. <a href="http://ch00ftech.com/wp-content/uploads/2014/08/inrush.png"><img class="aligncenter size-full wp-image-4513" title="inrush" src="http://ch00ftech.com/wp-content/uploads/2014/08/inrush.png" alt="" width="850" height="416" /></a> The idea is that when power is connected to the USB side, the FET is off. R1 and C1 provide a low pass filter that slowly raised the gate voltage of Q1 which slowly turns on. As Q1 turns on, the voltage across CLOCK_5V and CLOCK_GND slowly rises. This ensures that the clock turns on without drawing a bunch of current all at once. R2 simply provides a path for C1 to discharge when it's unplugged. Otherwise, the FET will stay on when the circuit is unpowered preventing it from slowly turning on the clock in the future (this part isn't 100% necessary since the cap will internally discharge slowly, but it could take a few minutes). The circuit was too simple to justify etching a PCB, so I just did it on perf board. You might recognize this flake of perf board from my trip to <a title="China is Awesome" href="http://ch00ftech.com/2012/09/10/china-is-awesome/">China</a>. <a href="http://ch00ftech.com/wp-content/uploads/2014/08/inrush-board.jpg"><img class="aligncenter size-full wp-image-4516" title="inrush board" src="http://ch00ftech.com/wp-content/uploads/2014/08/inrush-board.jpg" alt="" width="1024" height="768" /></a> A little heat shrink and electrical tape, and voila. <a href="http://ch00ftech.com/wp-content/uploads/2014/08/inrush-cable.jpg"><img class="aligncenter size-full wp-image-4517" title="inrush cable" src="http://ch00ftech.com/wp-content/uploads/2014/08/inrush-cable.jpg" alt="" width="1024" height="768" /></a> Worked on the first try. I shot the video on the SLR that I use for all of my videos this time opting to use the 24p option to simulate a film look. I also invested in a camera slider to help me get some smooth tracking shots. <h1>Conclusion</h1> So there you have it. This is my first attempt to create something that I would consider more Art than Engineering, and I'm very happy with the result. I'm curious to see what the general response to it is. There are still a few bugs with the display where I'll occasionally see text get half way across before disappearing. My running theory is that timing bugs sometimes delay serial messages and cause the text to repeat before getting a new paragraph. I speculate that it's mostly due to delays inherent in the display which after all is not designed for this use, and there's probably nothing I can do to fix it. If I wanted to make this into a real consumer device, I'd have to build the display from scratch, but that's pretty much true anyway. Unlike the QR clock, I don't think I'll be racing to push these out to production. Sure, I could re-design the display from scratch, but the BOM cost from the GPS, display circuit, and enclosure would place the asking price of this thing well over$ 500.