on making and using ebooks



by now i've put a fair amount of time into hobbyist book digitization and dedrming ebooks. this is my accumulated wisdom: reasons, tools, tricks. updated 2021-10-16.

obligatory note on piracy: it's illegal, which is to say wrong! an offense in the eyes of God, by facilitating it you will bring about the dissolution of orderly society. meditate long and hard on the tears you will bring to me if you use my techniques for evil, and more than that, the tears you will bring to “Mr. Routledge,” the publisher man in a raggedy suit whose loving care produced the most nondescript academic book covers in history. any references to the illicit distribution of ideas in what follows are literary embellishment.

origin of a habit

i once had a prefigurative vision of myself in old age, ravaged by tsundoku, lost in stacks of books which would consume all my funds while mostly never being read. naturally i resolved to avert this dread fate. tsundoku arises from a simple cause, i reasoned: when one feels the impulse of interest in a book, one obtains it with the purest intention to read. but mere possession brings with it enough satisfaction that the impulse to read slackens prematurely, and one listlessly discards the volume in favor of the next prize. the solution i devised was to make possession immaterial to me: switch to reading pirated ebooks rather than paper copies. not only would downloading an epub provide a less addictive pleasure than holding an object in my hands, but if i could just get over the learning curve, i'd save a sizable sum.

also, there was the problem of my decaying focus: by age thirteen i realized it was increasingly difficult for me to keep my eyes on a text for very long without my mind wandering. i'm still not sure whether to think of that as adhd or a vision problem or my being a victim of internet-induced mass attention death. i took to audiobooks to compensate for the issue, since the narrator's insistent voice refused me any chance for distraction. when no audiobook exists, i can use text-to-speech on an ebook for a similar effect, reading the text at the same time as i listen.

as in any fable of fate defied, the result of my efforts was whole new inconveniences. so used to immediately receiving a pdf of any book i wanted, i no longer know how to accept when something is not available. if i can't find something from the usual sources online, i'm overcome with anger, and will go to almost any length to obtain it. if i can get it from the library, i digitize it, and if i don't know the language, i learn to translate it (can't do the latter as often, of course).

my digitization workflow

several times now i've had someone ask me how to digitize a book properly. my method relies on basic command-line navigation skills, and the tools are all things that run on linux. the fact that everything is for linux means my elaborate answers here honestly are probably not useful to the askers. i haven't timed myself at work, but can say vaguely that it takes a couple hours to scan a couple books, and a couple more hours later to turn the scans into a useful format.

  1. to start with, i use a flatbed scanner—there are fancier setups that aim to be easier on a book's spine, but in my experience the strain is negligible and easily justified by the digital immortality that the book will gain by the ordeal.

  2. scan the pages in grayscale mode (not B/W), except where color is necessary. output should be tiff format, whatever high resolution option you have. you need 300dpi+ to get good ocr results, according to conventional wisdom, and erring on the higher side shouldn't hurt since you'll compress it in time for the pdf output. the actual scanning is tedious but tiring in the same way driving is; it's not stimulating but you need to pay a little attention to make sure you're getting every spread and they're not coming out wrong. the scan software should have a preview of your results, but i can't recommend a specific tool because i just use what runs on the scanner workstations at my library and then copy it to a usb key to take home. i pretend i'm back in boarding school and the book is a younger boy who hasn't been paying his dues. dunk his head in the toilet, hold it down, pull up for air, repeat. you get into the rhythm of it, and i fend off boredom by listening on my earbuds to whatever i digitized last week.

  3. post-process the tiffs with scantailor advanced. you can compile it yourself, or i think get a build from a ppa on ubuntu or from nixpkgs (haven't double-checked this). it's pretty straightforward to use and halfway automates most of the steps: fixing image orientation, splitting spreads into pages, deskewing pages, selecting the content region of the page, positioning the content on the output pages, and generating the final output. the main things you have to intervene in by hand are content selection, positioning, identifying regions you don't want converted to black and white, and maybe manually cleaning up stray marks or annotations. the virtue of letting scantailor convert grayscale to b/w instead of the scanner itself is that it can tell undesirable shadows from desirable text and gives you precise control over the line between black and white.

  4. you need to write a few metadata files. i don't put a lot of detail into them. create a metadata.yaml for the epub version in the project directory like this:

    - type: main
      text: "Title"
    - role: author
      text: "Author"

    in the scantailor output directory (i make this a subfolder of the one dedicated to the book project one), make an equivalent file metadata for the pdf:

    Author: "Author"
    Title: "Title"

    you'll also need a dummy bookmarks file to generate the pdf. you can make it a proper table of contents later, for now just do this:

    "Cover" 1
  5. cd into your scantailor output directory. to generate markdown and pdf output, you'll need tesseract for ocr (probably tesseract-ocr in your distro repos), hocr-combine from hocr-tools (install python from distro repos, then pip install hocr-tools), pandoc (probably in distro repos), imagemagick (ditto), and pdfbeads (install ruby from distro repos, then gem install pdfbeads). might have forgotten some dependencies. i use a script called bind, which looks like this:

    export COUNT=1
    export TOTAL=`ls -1|wc -l`
    for f in *.tif; do
      echo "OCRing $f ($COUNT of $TOTAL)"
      tesseract -l eng $f $(basename $f .tiff) hocr
      export COUNT=$((COUNT + 1))
    rename -v 's/.tif.hocr/.html/' *.tif.hocr
    hocr-combine *.html|pandoc -f html-native_divs-native_spans -t markdown+smart -o ../book.md
    pdfbeads -C bookmarks -M metadata > ../book.pdf

    change "eng" on the tesseract line to the relevant language code if the book isn't in english.

  6. you'll want to add a real table of contents to the pdf. the more manual way is: open book.pdf and use the page numbers in it to write a better bookmarks file, then run the pdfbeads command from the above script again to make a final pdf. the other one is to use emacs toc-mode. if you can get it installed, it's extremely convenient! it does depend on you knowing your way around emacs, though.

  7. the markdown file book.md is the corpus you can use to generate an epub. you'll want to edit it to fix typos and remove page numbers (regex helps for this), add chapter headings, and if necessary insert images. this step is kinda optional, depending on marginal return. i'm often satisfied with only a pdf for nonfiction books, or i put the minimum of effort into the epub that i need to make it readable by text-to-speech. when you're satisfied, do pandoc book.md -o book.epub --metadata-file=metadata.yaml.

converting and improving existing ebooks

many ebooks are put together badly in one way or another: pdfs that are badly scanned, have a spread per page, skewed pages, bad or nonexistent ocr layer, no table of contents. there are a few ways to correct them short of finding and rescanning the book.

books that i have put together

many of the things i've scanned i don't read, or not immediately, so i can't always say whether something is worth reading. more or less everything i've scanned or dedrmed is stored on aaarg, as well as in my ebook library.

dedrming books

to remove drm from a book, you will need calibre and the dedrmtools plugin. this works for the adobe digital editions pdf/epub books you might get from your school library (i use a windows vm for this part), as well as for books from the kindle store. for kindle books, it helps if you have a physical kindle and can use the "download and transfer by usb option," else you need some particular version of the kindle desktop software. amazon is lenient with refunds, so you can probably buy a book, download and dedrm it, then immediately refund it! however, no guarantees that they won't catch onto you and deny the refund, especially if you do it frequently.

it also used to be possible to sign up for audible with a new account and dummy credit card information, download the free trial audiobook very quickly before the payment verification fails, then decrypt it and remove the drm. however, i haven't done this in a long time, and i remember having trouble the last time. dedrming audible books should still be easy enough using audible-activator and ffmpeg, you just will probably have to pay for it.

tools for reading ebooks

i've ended up trying many ebook readers, and only a few have been particularly worthwhile.