Chapter 5. Text File Format (Poor Man's Markup)

The goal is to archive the Collected Works (printed in 1960s and 1970 by Progress Publishers and Foreign Languages Press) in two places: The Marxists Internet Archive and Project Gutenberg. The former uses HTML web pages for storage. The later uses plain-vanilla ASCII text for storage over the time-frame of several centuries (on the assumption that ASCII will out-live every other file format).

The MIA's web pages are heavily formatted ASCII text using HTML for the sake of presentation. Gutenberg's text files are plain vanilla ASCII text.

This document describes a 2-page, raw-data file format that uses plain ASCII with a little bit of HTML and a little bit of TeX. It's called “Poor Man's Markup” It's simpler than HTML markup. It's slightly more complicated than Gutenberg's standards because we want each ".tx" file to model or mirror the original book. Each text file contains at most two pages (a verso and a recto) as output by an OCR program, with cleaning and spell-checking. What is seen in a 2-page text file should match the original book. There are some exceptions, for exapmle, an SGML comment that specifies publisher and edition (using "src=").

This 2-page, raw-data text file format should "model" (or match) the original Collected Works. It's important to have an electronic copy of one's source to serve as a baseline before making editorial decisions. Since there are several editions of the Works, there will potentially be two or more sets of text files for a single document.

Here is a quick guide to ".tx" files: Guide to Poor Man's Markup.

Here is a demonstration of ".tx" files: Sample of Poor Man's Markup.

These are also known as "Poor Man's Markup" files (see "Manual Edits" in the above document). It's an evolving "standard", so not much work has gone into formalizing guidelines (expect more formal guidelines near the end of 2005). The basic idea is to model, or mirror, or reproduce, or even photocopy, the original book using plain ASCII text in a short-hand fashion. Later, ".tx" files are used as a digital source for human-readable formats such as HTML and PDF.

Basically, ".tx" files are the bridge between book and computerized versions of the book. Usually this type of bridge is done using XML. But XML doesn't match the original too closely because of all those tags and entities. TeX does a better job of matching the original, that's why ".tx" is a hybrid of text and Tex. These ".tx" can be turned into XML, and after that transformations can be done to get HTML and PDF. The above script simply goes directly from ".tx" to HTML.

Here are some samples:

TO BE CONTINUED: Review Gutenberg's FAQ and their plain ASCII format.