2005-11-30T11:18:36Z
Dave Pawson.
link
Home
Subsetting Docbook
Norm presented at XML 2005 on docbook. Although I've been an admirer and user of docbook for some time, I guess in plain English I've never clearly understood the rationale behind the groupings of elements, from which filenames are derived. Norm has added some clear explanations in this area. Since I recently posted a first attempt at subsetting docbook (see docbook-apps archive) which I found relatively easy, I'm trying here to extract from Norms experience information to help with selection and additions from and to the version 5 docbook schemas.
I found these revealing, they may help you.
Must allow the generation of a (non-normative) XML DTD and W3C XML Schema.
The impact of this is from the knowledge that DTD's and XSD are both less powerful than Relax NG. This may have constrained the schema, but may help you realise why.
A few simple datatypes are used.
Note the 'few'. I soon realised how few and wondered why.
Uses XLink; allows most inlines to be link elements.
I note this. I haven't figured out the implications, yet.
Accessibility has been improved with the introduction of alt and annotation tags.
I like this. The two elements are widely available, perhaps you could keep that high availability, to retain the accessibility of docbook.
This was a 'duh' moment for me.
There are two main classes of elements in DocBook.
Hierarchy elements provide gross structure
Information Pool elements provide prose markup
The information pool could be reused in a new hierarchy
Conversely, the hierarchy could be preserved with a new technical vocabulary
These resonate with dbhierx.mod and
dbpoolx.mod files from revision 4 of docbook. In
particular, the pool of elements (following slide) inlines,
example, figure, table, graphics, verbatim, admonitions, lists - I
can now view them as a pool from which I can draw selective
elements as needed.
The next level down starts to make sense too.
Inlines are grouped into technical, error related, programming, product related, OS related, markup related, bibliographic related, publishing related, graphic related, keyboard related, indexing related, gui related and link related. 100 in total! Seeing them grouped like this begins to help me. For instance, in the rng schema I see,
<define name="db.general.inlines">
<choice>
<ref name="db.publishing.inlines"/>
<ref name="db.product.inlines"/>
<ref name="db.bibliography.inlines"/>
<ref name="db.graphic.inlines"/>
<ref name="db.indexing.inlines"/>
<ref name="db.link.inlines"/>
</choice>
</define>
Note the similarity? No, I don't think it's a coincidence
either! Given the ideas, the schema will provide the content
models, and the names can be used to select or add to new
definitions. The strings aren't hard to find. E.g. Operating
systems related turns out to be db.os.inlines.
The statement on paragraph variants I found edifying. The basic para element and two variants. formalpara has a title, and simpara holds only inlines, as apposed to containing other block elements (a simplification I implemented, clearly never having used simpara).
I've heard Norm talking about 'formal' blocks before, which basically are expected to have a title child. These are example, figure, table, and equation. The ones without a title being... you guessed it - informalexample, informalfigure, informaltable, and informalequation.
HTML dumped on us with the simple img element. Docbook is far more elegant, though again I must admit I've learned a formulea and used it, rather than understanding it. Norm makes it crystal clear with
The objects inside a mediaobject (or inlinemediaobject) are alternatives. The processor must choose exactly one. Either a textobject containing a single phrase or an alt element (in DocBook V5.0) can be used for alternate text for accessibility. A textobject may be used as the long description for accessibility.
I'm afraid I really mixed up the presentation and markup in my head. I'm sure you were clear on this.
What docbook refers to as verbatim environments is
clarified as programlisting, screen literallayout and
address. All respect whitespace properties of the
source.
What docbook calls 'admonitions', and I call warnings, are
identified as containing note, tip, important, caution,
and warning . Nice to be clear about the full list -
even if it enables me to remove one or all of them.
The descriptions go on. Six list types (I knew that) etc.
The so called special purpose markup I find selectively useful, perhaps a target ripe for subclassing? FAQ's, function synopses, OO programming, message sets, EBNF diagrams, MathML and SVG. I doubt many use them regularly, many will even then only want one of these. The groupings enable a clear focus for subsetting.
Remember dbhier.mod? I've used book, article and ... that's
it. Nice to see the full list of Set, Book, Part, Reference,
Preface, Chapter, Appendix, Bibliography, Glossary, Index, Article,
Section, Sect1, SimpleSect, RefEntry and RefSect1 . Quite a
list. Some of which I'm highly unlikely to use. Again nice to have
them gathered for the cull.
Norm, again, started me thinking about subsetting, after reading a thread on the docbook-apps list about people starting to develop a docbook-tiny. I've read and re-read chapter five of tdg, and basically admitted that although I could probably interpret sensibly a subsetted or extended docbook version 4, I felt incompetent to attempt it myself without some serious motivation. v5 however, gave me new hope. I use emacs as my XML editor of choice, and I've often thought it annoying when I'm presented with 153 elements to choose from for the contents of a paragraph. Due to the work involved, I'd never actually done anything about it (perhaps it wasn't such an annoyance then). Seeing how simple the new version made it look, I had a go. The other incentive was something that has puzzled me for some time. Many people who start with relax NG choose the compact syntax from preference as their tool of choice (syntax of choice?). I've always been happier with the XML syntax (since emacs helps me with that and jing error messages are .... terse, shall we say). This gave me an opportunity to become a little more familiar with the compact syntax. Once I became familiar with the idiom, I soon progressed at quite a rate using almost a random slash and burn approach, using the prompts from nxml-mode in emacs as the judge of my success. I'd spot an element I thought I'd never used, sought it out in the schema, added it to the customization layer.
Not unlike XSLT, rng takes the approach of the including schema overriding definitions within an included schema (See Eric's book). As was probably planned, this allows for, in fact is an essential part of the subsetting or extension mechanism.
A simple example of this is shown below.
div{
##
## db.list.block reductions. db.procedure db.variablelist db.segmentedlist
## db.glosslist db.bibliolist db.calloutlist db.qandaset
db.list.blocks =
db.itemizedlist
| db.orderedlist
| db.simplelist
}
This shows the original definition of db.list.block (in comments)
and the re-definition below it. I've knocked out procedure,
variablelist, segmentedlist, glosslist, bibliolist, calloutlist and
qandaset. All the removals are done in this way. Simple and
very elegant.
The addition mechanism is .... identical! Boring? Yes. Complex? No. Utilitarian? Yes. Liable to errors... probably. Like many simple things it may be error prone. For serious use I'd suggest time spent with the schema, The Definitive Guide and a few notes would reap serious benefits.
For me, the combination of Norm's slides, a new look at the naming conventions (probably too obvious to warrant documenting in the Oasis committees view) have increased my insight into Docbook, increased my liking for it and left me further in debt to the Docbook group. Thanks.
Keywords: docbook
Comments (View)Return to main index