Advanced topic: Improving the PDF

How the PDF is built
General repair scheme
Common problems and their solutions
XSL-FO references

Caution

Some of the topics covered in this section apply to the ant build and old versions of Apache FOP, they may no longer apply when using the Gradle-build, or the files may have a different path in the Gradle-build. When in doubt, ask for help on the firebird-docs mailing list.

Due to limitations in our build tools, the PDF output may suffer from some irritating defects, such as:

Widowed headers and titles (appearing at the bottom of the page, with the corresponding text block starting on the next page).
Page breaks at awkward positions in tables or lists.
Overly wide horizontal justification spaces.
Squeezed, truncated or otherwise messed-up page-sized content. This is a new feature, introduced with FOP 0.93.

This part shows you how to deal with these problems, should the need arise.

How the PDF is built

First you have to understand how the PDF is built. Contrary to the HTML generation, this is a two-step process:

The DocBook XML source is converted to a Formatting Objects (FO) file. FO – formally called XSL-FO – is also an XML format, but unlike DocBook it's presentation-oriented. This step is performed by a so-called XSL transformer called Saxon. The output goes into firebird-documentation/inter/filename.fo.
Another tool, Apache FOP (Formatting Objects Processor), then picks up filename.fo and converts it to filename.pdf, which is stored in firebird-documentation/dist/pdf.

If you give a build pdf command, two consecutive build targets are called internally: fo and fo2pdf, corresponding to the two steps described above. But you can also call them from the command line. For instance,

build fo -Did=qsg15

...transforms the 1.5 Quick Start Guide source to firebird-documentation/inter/qsq15.fo. And

build fo2pdf -Did=qsg15

...produces the PDF from the FO file (which must of course be present for this step to succeed).

In fact, build pdf is just a shortcut for build fo followed by build fo2pdf.

This setup allows us to edit the FO file manually before generating the final PDF. And that's exactly what we're going to do to fix some of those nasty problems that can spoil our PDFs.

General repair scheme

The general procedure for improving the PDF output by editing the FO file is:

Build the PDF once as usual with build pdf [arguments].
Start reading the PDF and find the first trouble spot.
Open the FO file in an XML or text editor.
Find the location in the FO file that corresponds to the trouble spot in the PDF (we'll show you how later).
Edit the FO file to fix the problem (we'll show you how later), and save it.
Rebuild the PDF, but this time use build fo2pdf [arguments]. If you don't, you'll overwrite the changes you've just made to the FO file, get the same PDF as first, and have to start all over again.
Check if the problem is really solved and if so, find the next trouble spot in the PDF.
Repeat steps 4–7 until you've worked your way through the entire PDF.

Notes

Although this FO-editing approach suggests that the problem lies in the FO file, this is not the case. The FO file is all right, but Apache FOP doesn't support all the nice features in the XSL-FO specification (yet). With our manual editing, we force the PDF in a certain direction.
It is important to fix the problems in document order. Editing the FO in one spot may lead to vertical adjustments at the corresponding spot in the PDF: more lines, less lines, lines moving to the following page, etc... These adjustments may affect everything that comes after it.

For the same reason, you should always look for the next problem after you have fixed the previous one. For instance, don't make a list of all widowed headers in the PDF and then start fixing them all in the FO file. Fixing a widowed header moves all the text below it downward, possibly creating new widowed headers and un-widowing others.
In general, you can keep the FO file open throughout the process. Just don't forget to save your changes before you rebuild the PDF. You must close the PDF before every rebuild though: once it's opened in Adobe (even in Adobe Reader), other processes can't write to it.
The entire process can be pretty time-consuming, so don't try to fix every tiny little imperfection, especially if you're a beginning FO hacker. In general, only the widowed headers are really ugly and make the document look very unprofessional. Fortunately, they have become very rare since we've moved to FOP 0.93.

The next section deals with the various problems and how to solve them.

Common problems and their solutions

Widowed headers

Problem: Headers or titles at the bottom of the page.

Cause: Apache FOP doesn't support the keep-with-next attribute everywhere.

Note

Since we've upgraded to Apache FOP 0.93, this problem – which used to be our biggest annoyance – has become extremely rare. Yet it may still occur under some circumstances. Or, more in general, there may be a page break you find awkward, e.g. after a line that announces what's to come and ends with a colon. This section helps you solve such cases.

Note that the example used here – a widowed section header – shouldn't occur anymore, but it's still usable to demonstrate the steps you have to take, especially for elements with an id attribute.

Solution: Force a page break at the start of the element (often a list, list item or table) that the title or header belongs to.

How: If the element has an id attribute (you can see this in the DocBook source), do a search on the id in the FO file. For example, suppose that you've just built the Firebird 2 Quick Start Guide and you find that the title Creating a database using isql is positioned at the bottom of a page. In the DocBook XML source you can see that this is the title of a section whose id is qsg2-databases-creating. If you search on qsg2-databases-creating from the top of the file, your first hit will probably look like this:

<fo:bookmark starting-state="hide"
             internal-destination="qsg2-databases-creating">

The fo:bookmark elements correspond to the links in the navigation frame on the left side of the PDF. So this is not yet the section itself; you'll have to look further. Next find:

<fo:block text-align-last="justify" end-indent="24pt"
          last-line-end-indent="-24pt"><fo:inline
   keep-with-next.within-line="always"><fo:basic-link
   internal-destination="qsg2-databases-creating">Creating a database...

Here, the id is an attribute value in a fo:basic-link. We're in the Table of Contents now. Still not there.

The third and fourth finds are often a couple of lines below the second; they serve to create a link from the page number citation in the ToC. But the fifth is usually the one we're looking for (unless there are any more forward links to the section in question):

<fo:block id="qsg2-databases-creating">

That's it! Most mid- and low-level hierarchical elements in DocBook (preface, section, appendix, para etc.) wind up as a fo:block in the FO file. Now we have to tell Apache FOP that it must start this section on a new page. Edit the line like this:

<fo:block id="qsg2-databases-creating" break-before="page">

Save the change and rebuild the PDF (remember: use build fo2pdf, not build pdf). The section title will now appear at the top of the following page, and you can move on to the next problem.

When there is no DocBook ID

What if the element has no DocBook id? You'll have to search on (part of) the title/header then. This is a bit trickier, because the title may contain a line break in the FO file, in which case it won't be found. Or the title element has one or more children of its own (e.g. quote or emphasis). This too will keep you from finding it if you search on the full title. On the other hand: the more you shrink the search term, the higher the probability that you will get a number of unrelated hits. You'll have to use your own judgement here; if there is some characteristic text shortly before or after the title you can also search on that, and try to locate the title in the lines above and below it.

No matter how, once you've found the title, go upward in the FO file until you find the beginning of the section – often identifiable by the auto-generated FO id:

</fo:block>
<fo:block id="d0e2340">
  <fo:block>
    <fo:block>
      <fo:block keep-together="always" margin-left="0pc"
                font-family="sans-serif,Symbol,ZapfDingbats">
        <fo:block keep-with-next.within-column="always">
          <fo:block font-family="sans-serif" font-weight="bold"
                    keep-with-next.within-column="always"
                    space-before.minimum="0.8em" space-before.optimum...
                    space-before.maximum="1.2em" color="#404090" hyph...
                    text-align="start">
            <fo:block font-size="11pt" font-style="italic"
                      space-before.minimum="0.88em" space-before.opti...
                      space-before.maximum="1.32em">The DISTINCT keyword
              comes to the rescue!</fo:block>

As you see, there may be quite a number of lines between the section start and the title text. Notice, by the way, how the title is split over two lines here.

Once you've found the fo:block that corresponds to the section start, give it a break-before="page" attribute just like we did before.

Why look for the section start and not apply the break-before attribute to the fo:block immediately enclosing the title? Well, doing the latter will print the title on the next page all right, but links from the Outline and the ToC will point to the previous page, because the “invisible” section start – the block tag bearing the ID – lies before the page break.

As said, the widowed header problem shouldn't occur anymore with sections, but it might still happen to some other objects like tables, figures etc. for which the stylesheets generate ids if you haven't assigned them yourself. In all those cases you can use the approach described above.

There are also numerous DocBook elements – in fact, the majority – for which the stylesheets don't generate ids. Examples are para, informaltable, the various list types, etc. In those cases, once you have located the text fragment in the FO file, simply apply the break-before attribute to the nearest enclosing fo:block.

Split table rows or list items

Problem: Table rows or list items split across page boundaries. (DocBook lists are implemented as fo:tables.)

Cause: Nothing in particular – there's no rule that forbids page breaks to occur within table rows.

Solution: If you want to keep the row together, insert a hard page break at the start of the row.

How: Find the row by searching on text at the beginning of the row or at the end of the previous row. The element you're looking for is a fo:table-row, but don't use that for a search term, because many DocBook elements (not only <table>s) are implemented using fo:tables and thus contain fo:table-rows.

Once the start of the split row is found, add a break-before attribute like you did with widowed headers:

<fo:table-row break-before="page">

Alternatively, you can give the previous row a break-after attribute.

Overly wide horizontal spaces

Problem: Very large horizontal justification spaces on lines above a long spaceless string. These large strings are often printed in monospaced (fixed-width) font:

Cause: Apache FOP often doesn't hyphenate these strings. Therefore, if the string doesn't fit on the line it must be moved to the next line as a whole. This leaves the previous line with “too little” text, making large justification spaces necessary. Note that in the example above, the large spaces on the top line are caused by the string on the line below, not by the one on the line itself.

Solution: You may have good reasons to leave the string unbroken. In that case, accept the wide spaces as a consequence. Otherwise, insert a space (or hyphen-plus-space) at the point where the string should be broken.

How: First find the string in the FO file by searching on (part of) its contents. If it's monospaced in the PDF, you'll almost always find it within a fo:inline element. Then look at the PDF and estimate how much of the as yet unbroken string would fit in the large whitespace on the line above. Back in the FO file, insert a space – possibly preceded by a hyphen – in the string at a location where it's acceptable to break it. Rebuild the PDF (build fo2pdf !) and check the result. If you've broken the string too far to the right, it will still be entirely on the next line. Too far to the left and the whitespace may still be too wide to your liking. Adjust and rebuild until you're satisfied.

One surprise you may get during this job is that, once you've broken the string in one place, Apache FOP suddenly decides that it's OK to hyphenate the rest of the string. This will leave you with a part of the string on the first line that contains your own (now erroneous) space but also extends beyond it. You'll now have to delete your space and break the string again at the spot chosen by Apache.

Inserting zero-width spaces

An alternative approach to the wide-spaces problem is to insert zero-width space characters at each and every point where the culprit string may be broken, leaving it to Apache FOP to work out which one is best suited. This is guaranteed to work at the first try, but:

it's only feasible if you have an editor that lets you insert ZWSPs easily;
you can only do this in places where it's OK to break the string without a hyphen.

Squeezed, truncated or otherwise messed-up page-sized content

Problem: Tables, figures or other formal objects are truncated or some parts are printed on top of others.

Cause: Formal objects are given a keep-together.within-page="always" attribute by the stylesheets. As of FOP 0.93, this attribute is always enforced, even if the object is too large to fit on a page. The result: wrecked content that is crammed together on one page.

Solution: There are three alternatives. 1: Use the corresponding informal DocBook element instead. 2: Insert a processing instruction in the DocBook source. 3: Remove the attribute from the FO.

How: Two solutions are applied to the DocBook source, the third involves editing the FO file:

If you don't mind leaving the element titleless, use informalequation / informalexample / informalfigure / informaltable instead of their formal counterparts equation, example, figure and table. These elements don't get the keep-together attribute during transformation, so they will be page-broken as necessary.
If it concerns a table and you want to keep the title, insert a processing instruction like this:
```
<table frame="all" id="ufb-about-tbl-features">
  <?dbfo keep-together='auto'?>
  <title>Summary of features</title>
  ...
  (table content...)
  ...
</table>
```
Adding the instruction if you're working in the source text is easy enough. With XMLMind, it's a bit laborious:
1. Place the cursor somewhere in the title or select the entire title element.
2. Choose Edit -> Processing Instruction -> Insert Processing Instruction Before from the menu. A green line will appear above the title.
3. Type keep-together='auto' on that line.
4. With the cursor still on the green line, choose Edit -> Processing Instruction -> Change Processing Instruction Target from the menu. A dialogue box pops up.
5. In the dialogue box, change target to dbfo and click OK.
By the way: you can do the opposite with an informaltable if you absolutely don't want it broken at page borders. The procedure is the same, except that you must specify always instead of auto. Be sure that the informaltable does fit on one page, though!

We don't have a similar provision for the other formal objects because we probably don't need it. (Things like this require work on our custom stylesheets, so we only implement them if we really feel the need.)
Ye olde fo-hacking way... open the FO file, locate the element (tip: give it an id in the DocBook source so it's easy to find) and remove the keep-together.within-page="always" attribute. A disadvantage is that this procedure has to be repeated every time the source changes and a new PDF is built. The other two solutions are persistent.

XSL-FO references

The official XSL-FO (Formatting Objects) page is here: http://www.w3.org/TR/xsl/

The Apache FOP homepage is here: http://xmlgraphics.apache.org/fop/

The Apache FOP compliance page is here: http://xmlgraphics.apache.org/fop/compliance.html. It contains a large object support table where you can look up which XSL-FO objects and attributes (properties) are supported. When consulting the table, please bear in mind that we currently use Apache FOP 0.93 (but with some home-made patches).