Chapter 3. Xproc connections

Revision History
Revision 0.12008-12-07T16:01:02ZDave Pawson
Initial Issue
Revision 0.22008-12-14T08:44:23ZDave Pawson
Additions on general connections

Table of Contents

Connections in Xproc
Explicit external connections
pipeline or declare-step?
Internal connections
Bridging a gap
Default connections
Ports and their defaults

Connections in Xproc

Having established that Xproc is all about steps and connecting them, it is worth spending time understanding how they are made, what defaults and implied connections exist and how the implementor fits in with external connections.

James puts it like this

For Xproc XML data "flows" into a pipeline and between its steps through a series of connected ports. The two most common and important ports are the primary input port (usually called "source") and the primary output port (usually called "result"). That sets up the model to use when thinking about XML data (not binary, not anything else, this is an XML pipeline) and its processing by the steps of your pipeline

It is likely, though not required, that any implementor will enable the primary input and output to be specified via the interface to the program, whether this is a command line parameter or via a GUI. That will provide an input and output connection at the two outer edges. Except you need to be aware that these are named source and result.

Calabash does this using the -i and -o switches. For example, to pass in to the pipeline as primary input document input.xml, the switch would be


  $calabash.sh -i source=input.xml somepipeline.xpl

This would make an association between that input file and the input port whose name is source.

Similarly, if the final output of a pipeline is required to be written to some file named output.xml, then for Calabash the command line might be


  $calabash.sh -i source=input.xml -o result=output.xml somepipeline.xpl

For those used to Unix terms, stdin becomes the default source for the main input. The final output is delivered to stdout. These are the defaults unless you mess with them. That seems to make sense and is implied, i.e. you don't need to specify that in your pipeline

Explicit external connections

If the requirement is to specify absolutely, within the pipeline document, the external input and output documents, another technique must be used. For the primary input (the equivalent of stdin) use

  <p:declare-step xmlns:p="http://www.w3.org/ns/xproc">
    <p:input port="source">
      <p:document href="doc4.xml"/>
    </p:input>

For this step, this declares that input will come from an external document (doc4.xml). However, don't be tempted to do the same for output! To specify op.xml as the output of the pipeline, use

   <p:store href="op.xml"  />

No, it isn't obvious is it. It will become clearer as you gain more experience of using Xproc, honest!

[Important]Important

p:input and p:output are not symmetrical! Just keep that in mind!

Example 3.1 shows this in use for an identity step.

Example 3.1. Explicit input and output

  <?xml version="1.0"?>
   <p:declare-step 
        xmlns:p="http://www.w3.org/ns/xproc">
     <p:input port="source">
       <p:document href="doc4.xml"/>              1
     </p:input>
     <p:identity/>                                 2
     <p:store href="op.xml"  />                3
   </p:declare-step>

1

This explicitly defines the external source

2

The identity step must be sequentially inbetween the input and output specifications

3

This explicitly defines the target for the pipeline output


That is how connections are specified between the pipeline and the environment. Remember that the href attribute can take a URI, so you can use this to draw on data from the internet.

pipeline or declare-step?

The CR says

All p:pipeline pipelines have an implicit primary input port named “source” and an implicit primary output port named “result”. Any input or output ports that the p:pipeline declares explicitly are in addition to those ports and may not be declared primary.

Whereas, for a p:declare-step element,

A p:declare-step provides the type and signature of an atomic step or pipeline. It declares the inputs, outputs, and options for all steps of that type.

The implications of this I found rather subtle. Since as a very general rule we can use either, when should choose one over the other?

Note that the former has implicit input and output ports, whereas declare-step may provide them? So, if you want to specify an input URI or file, then you must use your implementers way of getting XML to the source port of the pipeline. Likewise if you want to specify a fixed output URI/file, you must use your implementers method. Example 3.2 shows the use of a pipeline with calabash to collect the system properties.

Example 3.2. Using a pipeline with parameters from the application

<?xml version="1.0"?>

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc">

<!--<p:input port='mysource' primary="false">   1
  <p:document href="sysprop.xml"/>
</p:input>
-->
  <p:string-replace   match="/doc/episode/@value">
    <p:with-option name="replace" 
      select="concat('"',p:system-property('p:episode'), 
			   '"')"/>
  </p:string-replace>
  
  <p:string-replace match="/doc/language/@value">
    <p:with-option name="replace" 
		   select="concat('"',p:system-property('p:language'), 
			   '"')"/>
  </p:string-replace>
  
  <p:string-replace match="/doc/product-name/@value">
    <p:with-option name="replace" 
        select="concat('"',p:system-property('p:product-name'), '
                       "')"/>
  </p:string-replace>
  
  <p:string-replace match="/doc/product-version/@value">
  <p:with-option name="replace" 
        select="concat('"',p:system-property('p:product-version'), 
        '"')"/>
  </p:string-replace>
  
  <p:string-replace match="/doc/vendor/@value">
    <p:with-option name="replace" 
       select="concat('"',p:system-property('p:vendor'), 
       '"')"/>
  </p:string-replace>
  
  <p:string-replace match="/doc/vendor-uri/@value">
    <p:with-option name="replace" 
       select="concat('"',p:system-property('p:vendor-uri'), 
 '"')"/>
  </p:string-replace>
  
  <p:string-replace match="/doc/version/@value">
    <p:with-option name="replace" 
          select="concat('"',p:system-property('p:version'), 
   '"')"/>
  </p:string-replace>
  
  <p:string-replace match="/doc/xpath-version/@value">
    <p:with-option name="replace" 
          select="concat('"',p:system-property('p:xpath-version'), 
         '"')"/>
  </p:string-replace>
  
  <p:string-replace match="/doc/psvi-supported/@value">
    <p:with-option name="replace" 
             select="concat('"',p:system-property('p:psvi-supported'), 
           '"')"/>
  </p:string-replace>
  
</p:pipeline>  


1

Trying to specify an input port produces an error


Norm suggests, if you happen to want a pipeline that:

1. Has a single non-sequence input
2. Has a single non-sequence output
3. And allows parameters

then p:pipeline is a convenient syntactic shorthand for the p:declare-step that would provide the same features.

We expect this to be a common case, so I'd probably suggest that most people start with p:pipeline most of the time.

If you want to be more explicit (have a little more control), then use p:declare-step

Internal connections

That provides the external links to the pipeline. So how to do the same from within a pipeline? The implication is that the pipeline will do a fixed job, hence doesn't need any command line parameters. I think this needs explaining because of the defaults in place, put there for good reason I'm sure, good time saving devices; but I found them confusing at first.

As a good example of making good use of the defaults, take a look at example 1 in the CR/REC (duplicated here for convenience)

Example 3.3. A linear pipeline example

<p:declare-step 
          xmlns:p="http://www.w3.org/ns/xproc"
          name="xinclude-and-validate">
  <p:input port="source" primary="true"/>          1
  <p:input port="schemas" sequence="true"/>        2
  <p:output port="result">
    <p:pipe step="validated" port="result"/>       3
  </p:output>

  <p:xinclude name="included">
    <p:input port="source">                        4
      <p:pipe step="xinclude-and-validate" 
        port="source"/>
    </p:input>
  </p:xinclude>

  <p:validate-with-xml-schema 
       name="validated">
    <p:input port="source">
      <p:pipe step="included" 
         port="result"/>                              5
    </p:input>
    <p:input port="schema">
      <p:pipe step="xinclude-and-validate" 
          port="schemas"/>                            6
    </p:input>
  </p:validate-with-xml-schema>
</p:declare-step>

  

1

Main source - defaults to stdin

2

An ancilliary input, for the xsd schema

3

An output, from the result port of a step called 'validated'

4

The xInclude step takes its input from the pipeline (check the name?) hence from stdin

5

the validation step takes its input from the result port of the 'included' step (note this is a pipe connection between two steps)

6

This is an ancilliary input for the schema, taking its input from the schemas port of the pipeline


It's worth spending time getting your head round that one. Draw out the steps and the connections between them if it helps, or talk them through with yourself. You'll get the feel of it after a while. It's just strange at first.

Figure 3.1 shows this graphically

Figure 3.1. Graphical equivalent of Example 3.1

Graphical equivalent of Example 3.1


Notice from this the explicit connections in the pipeline, shown in the diagram.

1 shows as stdin
2 shows as schemas
3 is the first use of a pipe between two steps. This connects the result port of the validated step to the final output of the pipeline, shown as stdout.
4 shows as the connection from stdin to the source port of the included step
5 shows as a pipe between the result port of the included step and the source port of the validated step
6 shows as a connection between the schemas port on the overall pipeline and the schema input port on the step validated

This pipeline could have been written less verbosely, but it is nice to see how explicit connections can be named and used in more complex pipelines. Xproc CR shows this in the abbreviated form, using all defaults.

James Sulak has a post on his site which gives another view of internal connections

Bridging a gap

Given a pipeline where everything uses defaults, flowing one into another and finally to the default result port, it may become necessary to interrupt that flow for some reason. In order to 'bridge' the gap produced, it is necessary, on one step, to create a link back to a previous step manually.

This technique may also be used to link back to earlier steps. The general principle is shown below


  pipeline
    step1
    step2
     (insert wanted here)
    step3

becomes
  pipeline
    step1
    step2
     xxxxx - inserted step
    step3
      input port='source'
        pipe
          step='step2' (or some earlier step)
          port='result'
    

This creates a pipe back from within step 3, over the inserted step back to step 2. This is illustrated in Figure 3.2

Figure 3.2. Graphical representation of bridging

Graphical equivalent of previous example


Syntactically this is shown in Example 3.4

Example 3.4. Bridging across steps

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" name="props">

  <p:input  port="source" kind="document" >
    <p:document href="sysprop.xml"/>
  </p:input>

  <p:input port="parameters" kind="parameter" primary="true"/>

  <p:variable name="product-name" select="'Fribble Widgets'" />

  <p:string-replace   match="/doc/episode/@value">
    <p:with-option name="replace" 
		   select="concat('"',p:system-property('p:episode'), 
			   '"')"/>
  </p:string-replace>
 
  <p:string-replace name="sr3"           1
		    match="/doc/language/@value">
    <p:with-option name="replace" 
		   select="concat('"',
			   p:system-property('p:language'), 
			   '"')"/>
  </p:string-replace>

  <!-- This 'breaks' the default flow --> 2
  <p:identity/>

  <p:string-replace match="/doc/product-name/@value" >
    <p:with-option name="replace"
		   select="concat('"',
			   p:system-property('p:product-name'),
			   '"')"/>
    
    <!-- This re-joins the link, 
      between step sr3 and this step -->
    <p:input port="source">              3
      <p:pipe
	  step="sr3"                        4
	  port="result"/>                   5
    </p:input>
  </p:string-replace>

<p:store href="op.xml"/>

</p:declare-step>
  

1

The earlier step needs an identifying name attribute

2

The p:identity element breaks the default flow. Or would if it were other than an identity step.

3

Within the succeeding step, add an input specification, using p:input selecting the source value

4

Select the approprate step as the input to this one,

5

And select the result port as the one to which this step should be connected.


This shows how a step is connected to one other than the immediate preceding-sibling step.

Default connections

The basic statement for Xproc is that inputs precede steps and steps flow content sequentially from one to another in the sequence in which they are written. To model a pipeline of 3 steps


pipeline
  step1
  step2
  step3
/pipeline  


Without explicit connections, the input from outside the pipeline is connected to the pipeline top level input (source port) which in turn is connected to the first step (step1) source port. The result port of step1 is connected to the source port of step 2 and so on, until the output (result port) of step3 is connected to the pipeline top level output which then shows as stdout on the diagram.

These two terms that are used are primary readable input (source port) which is the pipeline main input and top level output port which shows as stdout on the diagram above.

If this default action matches your needs, then you don't need to add explicit connections. Simply place the steps in an appropriate sequence within the pipeline.

One final word on connections. It is an error to create a loop (see Xproc 2.4) by connecting inputs and outputs such that a loop is created.

James blogged about default and explicit links with a good example.

Ports and their defaults

This is a collation of information I obtained when I tried to understand how input ports and parameter ports are used for p:xslt.

Of note is the fact that a parameter port is a special kind of port. It has different binding rules from input ports

Firstly don't mix up "Parameter port" which is like a pipe and "parameter" which is data travelling inside the pipe

For example, taking the xslt 'step'

Example 3.5. 

<p:declare-step type="p:xslt" xml:id="xslt">
     <p:input port="source" sequence="true" primary="true"/>
     <p:input port="stylesheet"/>
     <p:input port="parameters" kind="parameter"/>
     <p:output port="result" primary="true"/>
     <p:output port="secondary" sequence="true"/>
     <p:option name="initial-mode"/>
     <p:option name="template-name"/>
     <p:option name="output-base-uri"/>
     <p:option name="version"/>
  </p:declare-step>

All the inputs are 'required'. Although some can be defaulted, although the conditions under which that happens isn't straightforward.

The primary data port is first The stylesheet data port is next The parameter port is third.

[Note]Note

Note. The primary parameter input of the step, whilst not marked as primary, is the only parameter input port on the step, hence it becomes implicitly the primary parameter port!

The Xproc processor always attempts to find a default binding based on so called "default readable" port. For normal inputs, it is usually the output of the preceding step, but for parameter input ports it is the primary parameter input port of the containing pipeline. If that container is the pipeline... then the association is implementor defined!

If the processor fails to find or manufacture a default binding, you will get an error.

[Important]Important

Important: Even if you don't want to use parameters, you must satisfy the requirement for sourcing the parameter port!

Note, from Xproc

If no binding is provided for a primary input port, the input will be bound to the default readable port. It is a static error (err:XS0032) if no binding is provided and the default readable port is undefined.

With regards to p:xslt, you don't technically have to have any child elements as long as the default readable port is defined and you don't mind if all your inputs are bound to that.

If you leave a parameter input port unbound, there are default rules for that too. And if there's nothing for the default to bind to, that's an error.

So, one of the following must be true:

  1. You declared a parameter input port on your top-level pipeline.

  2. You used p:pipeline to declare your top-level pipeline (this satisfies point 1 by default)

  3. You provided an explicit binding for the 'parameter' input port on your p:xslt step.

Here's how it works for parameter ports.

  1. If you don't specify a binding for the 'parameter' port, then it binds by default to the parameter port of the pipeline that contains it. This way, parameters you pass to the pipeline automatically get passed to the steps that can use them.

  2. If there is no binding for a parameter input port on the top level pipeline (the one that you start executing first), then it effectively is bound to an empty sequence.

    Parameter input ports always accept a sequence, so if you don't pass any documents to it, that's just an empty sequence. But that's not exactly the same as binding it to p:empty.

  3. If you declare your pipeline with <p:pipeline>, you get a parameter input port by default and things "just work".

  4. If you declare your pipeline with <p:declare-step>, then you have to either remember to provide a parameter input port explicitly:

 <p:declare-step ...>
   <p:input port="parameters" kind="parameter"/>
   <p:input port="source"/>
   ...

Or you have to remember to explicitly provide a binding when you use the XSLT step:

 <p:xslt>
   <p:input port="parameters">
     <p:empty/>
   </p:input>

A step could define more than one parameter input port (some standard steps do). The defaulting rules for the primary parameter input port (if there is one) and the non-primary ones are a little different. The primary one gets bound back to the pipeline parameters; the non-primary ones just get an empty sequences if undefined

So what about the primary parameter input port of the containing pipeline? If there is no binding there, is it bound to an empty sequence, or not?

It's bound to whatever the implementation decides to bind it to. How inputs are connected to XML documents outside the pipeline is implementation-defined.

In Calabash, if you pass a binding for that port on the command line, that's what it gets bound to. If you pass parameters on the command line, Calabash manufactures a c:parameter-set with those parameters and that's what it gets bound to. If you do neither of those, it gets bound to an empty sequence.