Chapter 3. Text inclusions

The general form of a text inclusion is:

  |<xi:include xmlns:xi='http://www.w3.org/2001/XInclude'
  |            href='/path/to/document.txt'
  |            parse='text'
  |            fragid='text(…)'/>

The parse attribute must be present and must have the value text, that’s what makes it a text inclusion. The fragment identifier is also optional; if it’s not present, the entire document is included. The attribute xpointer can be used instead of fragid, but that’s discouraged because technically an XPointer can only refer to an XML document.

Parsing the example from Chapter 2, XML inclusions as text, inserts the whole file:

Example 3.1. Text inclusion without a fragment identifier

  |<xi:include href="abstraction.xml" parse="text"/>

Subexample 3.1.1. The XInclude

  |<blockquote xmlns="http://docbook.org/ns/docbook" version="5.2">
  |<title>Abstraction</title>
  |<attribution><personname>Paul Hudak</personname></attribution>
  |<para xml:id="abs"><quote>Abstraction, abstraction and abstraction.</quote>
  |This is the answer to the question, <quote>What are the three most
  |important words in programming?</quote></para>
  |</blockquote>

Subexample 3.1.2. What’s included

3.1. XML fragment identifier schemes

char=

A char= fragment identifier is interpreted according to RFC 5147 with integrity checking.

Example 3.2. Text inclusion with a char identifier

  |<xi:include href="abstraction.xml" parse="text" fragid="char=68,87"/>

Subexample 3.2.1. The XInclude

  |tle>Abstraction</ti

Subexample 3.2.2. What’s included

line=

A line= fragment identifier is interpreted according to RFC 5147 with integrity checking.

Example 3.3. Text inclusion with a line identifier

  |<xi:include href="abstraction.xml" parse="text" fragid="line=3,5"/>

Subexample 3.3.1. The XInclude

  |<para xml:id="abs"><quote>Abstraction, abstraction and abstraction.</quote>
  |This is the answer to the question, <quote>What are the three most

Subexample 3.3.2. What’s included

L#-L#

This scheme is the loosely documented format supported by GitHub. It identifies a line or range of lines, for example L3 identifies line 3 and L3-L7 identifies lines 3 through 7, inclusive.

Example 3.4. Text inclusion with a L#-L# identifier

  |<xi:include href="abstraction.xml" parse="text" fragid="L3-L5"/>

Subexample 3.4.1. The XInclude

  |<attribution><personname>Paul Hudak</personname></attribution>
  |<para xml:id="abs"><quote>Abstraction, abstraction and abstraction.</quote>
  |This is the answer to the question, <quote>What are the three most

Subexample 3.4.2. What’s included

search=

The search= fragment identifier locates lines by searching within the text.

Example 3.5. Text inclusion with a searchidentifier

  |<xi:include href="abstraction.xml" parse="text"
  |            fragid="search=/&lt;para/,#/para#"/>

Subexample 3.5.1. The XInclude

  |<para xml:id="abs"><quote>Abstraction, abstraction and abstraction.</quote>
  |This is the answer to the question, <quote>What are the three most
  |important words in programming?</quote></para>

Subexample 3.5.2. What’s included

3.1.1. RFC 5147 integrity checking

Both the char= and line= flavors of RFC 5147 identifiers (and the search= extension scheme) support either file size or MD5 integrity checking. This fragment identifier: line=23,67;length=3134 will fail unless the file identified is 3,134 bytes long. Alternatively, line=23,67;md5=135b35933056ba8d06e8d3f5f4ecd318 will fail unless the file has an MD5 message digest equal to 135b35933056ba8d06e8d3f5f4ecd318.

Example 3.6. Text inclusion with integrity checking

  |<xi:include href="abstraction.xml" parse="text"
  |            fragid="line=3,5;md5=d6090e3280649716833e3c33269d1892"/>

Subexample 3.6.1. The XInclude

  |<para xml:id="abs"><quote>Abstraction, abstraction and abstraction.</quote>
  |This is the answer to the question, <quote>What are the three most

Subexample 3.6.2. What’s included

Many systems come with a program named md5 that will compute the MD5 hash of a file:

  |$ md5 abstraction.xml
  |MD5 (abstraction.xml) = d6090e3280649716833e3c33269d1892

Alternatively, you can specify an incorrect hash in the fragment identifier and SInclude will tell you what it was expecting when the integrity check fails.

3.2. Text searching

The search scheme has no official standard. I invented it a few years ago. The idea is that instead of using explicit character or line references, as RFC 5147 does, allow the user to identify the lines by what they contain.

Expressed in a lazy pseudo-BNF, it looks like this:

  |search      = "search=" startSearch? ("," endSearch?)? (";" searchOpt?)?
  |startSearch = searchExpr (";" startOpt?)?
  |endSearch   = searchExpr (";" endOpt?)?
  |searchExpr  = ([0-9]+)? (.) (.*?) \2
  |startOpt    = "from" | "after" | "trim"
  |endOpt      = "to" | "before" | "trim"
  |searchOpt   = "strip" | RFC 5147 integrity checks

The core of the syntax is the searchExpr. A search expression is an optional number, followed by any quote character, followed by a string delimited by a second occurrence of the quote character. The number allows you to find a specific occurrence of the string.

The expression 3/abcde/ finds the third line that contains the string “abcde”. So do 3#abcde# and 3xabcdex. If you leave the occurrence number out, it defaults to 1: /marker text/ finds the first line that contains the string “marker text”.

If you don’t specify a start expression, inclusion starts at the beginning of the file. If you don’t specify an end expression, all of the file after the starting match is included. It’s an error if the starting expression is specified and it never matches.

After that, it’s just a matter of a few useful options. On search expressions, the default options are from and to. They specify that the matched line is included. The values after and before, specify that the matched line is not included. The value trim specifies not only that the matched line is not included, but that any leading (in the case of start) or trailing (in the case of end) lines that consist entirely of whitespace are trimmed away.

The top level search option strip specifies that whitespace stripping should be performed on the start of each included line. The smallest indent value is determined and that number of whitespace characters is removed from the beginning of each line. The other top level search options are the RFC 5147 integrity check options.