PyGE Tech Notes Page

Python Gutenberg E-text Project

General Notes

1. Rationale for using E-texts

Perhaps we should first look at the advantages offered by physical books:

easy to carry around, pick up, or put down,
page oriented with generally pleasant formatting,
can be read in comfortable surroundings,
can have places easily marked (bookmarks, folded corners),
can have personal notes written inside.

And then list a few disadvantages:

costs money to buy,
single copy can only be used in one place at a time,
can be lost or misplaced,
can deteriorate over time.

E-texts seem to offer effective solutions to the disadvantages listed above. Electronic literary selections from free archives like Project Gutenberg would solve problems of cost, while allowing unlimited copies of works to be kept by users on any number of devices at the same time. The ability to keep multiple copies reduces the chances of irretrievably losing a valued work; and electronic versions do not deteriorate with time.

In order to make reading e-texts a desireable alternative or complement to reading physical books, the user experience with e-texts should also provide many of the advantages of regular books. In practice, not all of the advantages may be achievable at the same time. For example, electronic devices that are small and portable may not be able to provide high-resolution displays with visual quality comparable to a printed page. And larger computer displays that provide high-quality text display may not be conveniently used in all environments that a user might wish to read a book, such as sitting on the beach. The question is whether the compromises can be made acceptable for a wide range of users, and whether additional benefits of e-texts can make up for perceived shortcomings.

Electronic storage of e-texts also offer some capabilities that are not available with ordinary books written on paper. Software allows e-texts to be easily searched for particular words or phrases. With some e-text formats, user-supplied material such as annotations and notes may be appended to a file. These personal entries can then also be electronically indexed, searched, and recalled as needed.

We would suggest that the use of e-texts be considered in a scenario which combines the prevalence and utility of PDA devices with the power and display capabilities of desktop and laptop computers. Users should be free to choose the methods of viewing e-texts which best suit their needs, and the e-text medium should fully support their decisions.

The PyGE Project will initially concentrate on providing desktop-oriented solutions for viewing and managing e-texts on personal and laptop computers. Other projects provide applications, such as the free Weasel Reader, that are targeted for PDA-oriented solutions.

2. Plain text files

Project Gutenberg releases e-text files primarily in plain text format. While this format guarantees high compatibility with a number of software programs that may be used to view them, it also presents a number of difficulties for programs which may attempt to provide enhanced display options. Some of these difficulties include:

all lines are limited to 80 characters or less in length,
inconsistent presentation makes reliable automatic parsing of contents difficult,
identifying information such as author names and titles can appear in different forms and places within the text,
section markers such as chapter headings are not clearly differentiated from ordinary text,
text formatting information such as font style or size is not included in plain text files,
text is a continuous stream with no pagination,
formatting styles which are important for works such as poetry can be lost or misrepresented.

One of the primary goals of the PyGE project is to overcome the above difficulties and produce visual displays of e-text contents which are both pleasant on the eyes and easy for the average user to control and customize to his or her tastes.

3. zTxt files

It was decided that original plain text files from Project Gutenberg had too many limitations to be used as the basis for an enhanced e-text reading experience. That meant a new file format would have to be chosen on which to base further PyGE developments.

When deciding which e-text file format to support in the PyGERS reader program, there were several desirable characteristics we were looking for. These included:

already established e-text format (no desire to reinvent the wheel),
open specification and/or an available Python reader/writer implementation,
support for improved text formatting compared to plain ASCII text files,
support for other useful features.

The zTxt file format, developed by John Gruenenfelder for the Weasel Reader application for Palm devices, fit the requirements nearly perfectly. It provided:

an open file specification,
a format that has been incorporated in existing applications,
implementations that exist in a number of programming languages, including Python,
flowing paragraphs which can adjust to size of display,
support for modifiable bookmarks,
support for modifiable annotations,
reduced file size through text compression,
a direct tie in to Palm PDA-based e-text reading.

4. Speech output

Speech output seemed to be a natural candidate for inclusion as a feature in the PyGERS reader program. On the one hand, it improves the accessibility aspects of PyGERS, which in turn improves the accessibility of literary works in the form of e-texts such as those from Project Gutenberg. Perhaps just as important, it is one feature which helps to set PyGERS apart from other e-text reader programs.

Providing speech output in PyGERS, which was intended to be portable to many different types of machines, meant that several different speech synthesis engines would have to be supported. The reason was that no single high-quality speech synthesis approach could be found which supported all target platforms.

4.1 Speech synthesis with Festival

While investigating opportunities for supporting text-to-speech synthesis on Linux/Unix systems, the one project which stood out from all other free applications in terms of maturity, portability, and out-of-the-box capability was Festival. Festival includes a complete voice synthesis engine, along with several working voices.

On the down side, the festival voice synthesizer was originally designed to be a tool for voice synthesis researchers, and not as a product in its own right. As a result, it has limited capabilities for interaction with a controlling program. Its command language is a dialect of Scheme, which many programmers today may be unfamiliar with. And installation of the package requires either compiling the source files or finding already built binary installation files for a target platform, which the authors of festival do not provide.

If the festival command is detected to be valid on a non-Windows system, speech output will be automatically enabled in PyGERS. This means the festival command should be on the user's executable search path.

4.2 Speech synthesis with Microsoft SAPI

Microsoft provides solutions to the problem of controlling speech-related software components with its Speech API (SAPI) for Windows. The problem with Microsoft SAPI is that it has been a moving target over the years. As a result, systems today may be found with no SAPI software installed, with SAPI4 software installed, and/or with SAPI5.1 software installed. In the near future, Microsoft is planning to release yet another version in the form of Microsoft .NET Speech software.

Despite having completely different interfaces, both SAPI4 and SAPI5.1 offer very similar basic speech synthesis capabilities. They even offer similar voices named Mike, Mary and Tom. Since PyGERS supports both interfaces, it will by default enable speech output if either SAPI5.1 or SAPI4 is detected as installed on a Windows system (choosing SAPI5.1 if both are installed).

If neither of the supported SAPI interfaces is installed, the user can go to the Microsoft web site and download software for either one, albeit with a large differences in the size of the downloads and the disk space required for installation. SAPI4, while considered obsolete and now completely unsupported by Microsoft, can be installed from two files totaling around 2 Mbytes in size and will work fine with PyGERS. SAPI5.1 installation will require downloading an entire SDK (software developers kit) that weighs in at a whopping 68 Mbytes.

5. Acquiring data on Project Gutenberg e-texts

During its many years of producing e-text files of copyright-free literary works, Project Gutenberg has focused on generating plain text files that were intended to be read by real people, and not necessarily to be automatically parsed by machines. Over time, differences in the way thousands of volunteers performed their transcriptions, along with a natural evolution in the boiler plate material included with every e-text, resulted in e-texts in which even basic information such as complete author name and book title could not always be reliably detected.

One approach to correctly mapping e-text properties to e-texts is to maintain a separate database of e-texts and their attributes. E-texts can be searched for uniquely identifying marks, such as the Project Gutenberg e-text numbering system, and those marks used to index into the database to retrieve desired information. This approach has the added advantage that the database could form the basis for developing search mechanisms allowing users to search for e-texts meeting user-specified criteria.

One difficulty in implementing an external database approach is in obtaining correct and consistent information needed to populate the database. Project Gutenberg maintains several official text files which serve as snapshot listings of available e-texts, but each of these suffers from formatting that is not well suited for computerized reading and from entries that lack complete and consistent contents. An example of such a listing can be found at http://www.ibiblio.org/gutenberg/GUTINDEX.ALL.

The PyGE project attempts to solve the problem by reading web pages from the Project Gutenberg site that contain information about each e-text, parsing the web pages to extract information (from HTML table entries), and writing the information out to a file in XML format. The process of acquiring information from thousands of web pages is fairly slow, requiring an hour or more to complete at times. In addition, this method of data acquisition runs the risk of failing if the location or formatting of the source HTML files ever changes. Fortunately, the majority of available e-text entries are expected to change slowly over time. This means that going through a time-consuming update of database information should not have to be done very often.

On machines with an internet connection, a new database file can be created within PyGETS by invoking the Acquire... function located under the File menu. An existing database can be updated after loading into PyGETS by invoking the Update... function located under the File menu. Updating only takes the time to scan pages that have not been seen before, and appends the new information it finds to the current information.

To help users get started quickly using PyGE applications, distributions of PyGE include a snapshot XML database file named gutenberg.xml in the SampleData directory. Users can open the sample file in the PyGETS application and use it as the basis for most search operations. Using the sample database file as a starting point, current e-text information can be quickly obtained with the Update command described above to cover e-texts that have been released since the sample file was created.

6. Building distribution files

PyGE software is intended to be cross-platform and portable to many different computing platforms. To accomplish this, it is distributed in both source form and in packaged binary formats.

Source distributions are smaller in size and offer the best portability, but require that various software components such as a Python interpreter and the wxPython package already be installed on machines before using the software. Source distributions are more appropriate for experienced users and software developers interested in studying how software is written.

Binary distributions are targeted for specific platforms, such as Microsoft Windows or Linux running on Intel-compatible processors, support standard installation procedures, and have fewer software prerequisites because most software components needed to run the software are included with the distribution. Binary distributions are more appropriate for the typical computer user or system administrator who wants a simple way to install and use software.

Building both source and binary distribution packages for PyGE with the intention of making the software as widely available as possible requires that a number of steps be followed. The latest instructions for building PyGE distribution files can be found in this BUILD file.

Last modified:
Sun Aug 24 23:29:06 PST 2003