4DOM Extensions

These are utility classes and functions that provide capabilities not yet specified in the DOM spec. Some of these facilities, such as factories and readers are expected to be specified in later levels of the DOM, so we try to keep our proprietary interfaces simple for now so that you can more painlessly migrate when relevant standards emerge.

Reading

The Reader package allows you to parse source strings in XML and HTML into DOM trees. You select a reader module according to the nature of your input. The readers that come with 4DOM are as follows:

The following two examples illustrate using PyExpat and HtmlLib readers. Replace with the appropriate module and use in your own code.

Module PyExpat

Classes

Class Reader

fromStream

fromStream(stream, ownerDoc)

return a 4DOM node from the given stream
Parameters

stream of type Python file object

The stream to be read for XML text

ownerDoc of type xml.dom.Document.Document

A document to be used as owner of all the created nodes. If None, a new document instance is created for the nodes. Default is None.

Return Value

a new document instance, or if the ownerDoc argument was not None, a document fragment. In either case, the returned node roots the created XML tree.

fromString

fromString(xmlString, ownerDoc)

return a 4DOM node from the given string
Parameters

xmlString of type string or unicode object

The string to be parsed for XML text

ownerDoc of type xml.dom.Document.Document

A document to be used as owner of all the created nodes. If None, a new document instance is created for the nodes. Default is None.

Return Value

a new document instance, or if the ownerDoc argument was not None, a document fragment. In either case, the returned node roots the created XML tree.

fromUri

fromUri(uri, ownerDoc)

return a 4DOM node from the given uri
Parameters

uri of type Python file object

The uri from which XML text is to be retrieved

ownerDoc of type xml.dom.Document.Document

A document to be used as owner of all the created nodes. If None, a new document instance is created for the nodes. Default is None.

Return Value

a new document instance, or if the ownerDoc argument was not None, a document fragment. In either case, the returned node roots the created XML tree.

Module HtmlLib

Classes

Class Reader

fromStream

fromStream(stream, ownerDoc, charset)

return a 4DOM node from the given stream
Parameters

stream of type Python file object

The stream to be read for HTML text

ownerDoc of type xml.dom.Document.Document

A document to be used as owner of all the created nodes. If None, a new document instance is created for the nodes. Default is None.

charset of type string

The character set of the HTML text. If None or empty string, the default is ISO-8859-1. Default is empty string.

Return Value

a new document instance, or if the ownerDoc argument was not None, a document fragment. In either case, the returned node roots the created HTML tree.

fromString

fromString(htmlString, ownerDoc, charset)

return a 4DOM node from the given string
Parameters

htmlString of type string or unicode object

The string to be parsed for HTML text

ownerDoc of type xml.dom.Document.Document

A document to be used as owner of all the created nodes. If None, a new document instance is created for the nodes. Default is None.

charset of type string

The character set of the HTML text. If None or empty string, the default is ISO-8859-1. Default is empty string.

Return Value

a new document instance, or if the ownerDoc argument was not None, a document fragment. In either case, the returned node roots the created HTML tree.

fromUri

fromUri(uri, ownerDoc, charset)

return a 4DOM node from the given uri
Parameters

uri of type Python file object

The uri from which HTML text is to be retrieved

ownerDoc of type xml.dom.Document.Document

A document to be used as owner of all the created nodes. If None, a new document instance is created for the nodes. Default is None.

charset of type string

The character set of the HTML text. If None or empty string, the default is ISO-8859-1. Default is empty string.

Return Value

a new document instance, or if the ownerDoc argument was not None, a document fragment. In either case, the returned node roots the created HTML tree.

Printing/Writing

The Printer module allows you to write a text representation of DOM nodes to an output stream, including stdout. Note that limitations in the SAX interface used to parse in XML files, and in the DOM spec itself make it impossible at this point to handle an unchanged "round trip". That is, if you use the builder to build a DOM node from text and then use the Printer to turn it back to text, there may be differences; some may be significant.

The easiest way to use the Printer module is through the front-end functions in the xml.dom.ext package.

xml.dom.ext.Print

xml.dom.ext.Print(root, stream, encoding)

Render the DOM tree to text with no special formatting.
Parameters

root of type xml.dom.Node

The node to be printed, with all its children recursively.

stream of type output stream

The output stream. Note: can be a StringIO object if you want to generate a string instead. Default is sys.stdout.

encoding of type string

The character encoding to use for output. Default is 'UTF-8'.

Return Value
None

xml.dom.ext.PrettyPrint

xml.dom.ext.PrettyPrint(root, stream, encoding, indent, width, preserveElements)

Render the DOM tree to text, with added indentation and new-lines for enhanced readability.
Parameters

root of type xml.dom.Node

The node to be pretty-printed, with all its children recursively.

stream of type output stream

The output stream. Note: can be a StringIO object if you want to generate a string instead. Default is sys.stdout.

encoding of type string

The character encoding to use for output. Default is 'UTF-8'.

indent of type string

The amount by which nested constructs are indented when printed on a fresh line. Default is '\t'.

width of type positive integer

The width of the output console. Used to make line-break decisions. Default is 80.

preserveElements of type list of strings, each of which is an SGML generic identifier.

Specifes elements in which white-space shouldn't be added. Note that white-space is never added to in-line elements in an HTMLDocument. Default is None.

Return Value
None

xml.dom.ext.XHtmlPrint

xml.dom.ext.XHtmlPrint(root, stream, encoding)

Render an HTML DOM tree as XHTML with no special indentation or formatting.
Parameters

root of type xml.dom.Node

The HTML node to be printed, with all its children recursively.

stream of type output stream

The output stream. Note: can be a StringIO object if you want to generate a string instead. Default is sys.stdout.

encoding of type string

The character encoding to use for output. Default is 'UTF-8'.

Return Value
None

xml.dom.ext.XHtmlPrettyPrint

xml.dom.ext.XHtmlPrettyPrint(root, stream, encoding, indent, width, preserveElements)

Render an HTML DOM tree to text, with added indentation and new-lines for enhanced readability.
Parameters

root of type xml.dom.Node

The node to be pretty-printed, with all its children recursively.

stream of type output stream

The output stream. Note: can be a StringIO object if you want to generate a string instead. Default is sys.stdout.

encoding of type string

The character encoding to use for output. Default is 'UTF-8'.

indent of type string

The amount by which nested constructs are indented when printed on a fresh line. Default is '\t'.

width of type positive integer

The width of the output console. Used to make line-break decisions. Default is 80.

preserveElements of type list of strings, each of which is an SGML generic identifier.

Specifes elements in which white-space shouldn't be added. Note that white-space is never added to in-line elements in an HTMLDocument. Default is None.

Return Value
None

Miscellaneous

xml.dom.ext.NodeTypeToInterface

xml.dom.ext.NodeTypeToInterface(nodeType)

Look up a node type (as returned from getNodeType()) and returns a corresponding interface name.
Parameters

nodeType of type One of the integers defined as node types in xml.dom.Node

The node type to look up.

Return Value

string

Name of corresponding DOM interface from spec.

xml.dom.ext.StripHtml

xml.dom.ext.StripHtml(startNode, preserveElements)

Strips extraneous white-space from an HTML DOM tree.
Parameters

startNode of type xml.dom.Node

The node to be stripped, with all its children recursively.

preserveElements of type list of strings, each of which is an SGML generic identifier, or None to indicate an empty list.

Specifes elements from which white-space shouldn't be stripped. Note that white-space is never stripped from in-line elements in an HTMLDocument. Default is None.

Return Value

xml.dom.Node

The startNode with descendant ignorable white-space stripped.

xml.dom.ext.StripXml

xml.dom.ext.StripXml(startNode, preserveElements)

Strips extraneous white-space from an XML DOM tree. Takes xml:space attributes into account.
Parameters

startNode of type xml.dom.Node

The node to be stripped, with all its children recursively.

preserveElements of type list of strings, each of which is an SGML generic identifier, or None to indicate an empty list.

Specifes elements from which white-space shouldn't be stripped. Default is None.

Return Value

xml.dom.Node

The startNode with descendant ignorable white-space stripped.

xml.dom.ext.GetElementById

xml.dom.ext.GetElementById(startNode, targetId)

Returns the element node whose "ID" attribute is as given.
Parameters

startNode of type xml.dom.Node

The node whose descendants are to be searched.

targetId of type string conforming to XML ID type

The XML ID to find.

Return Value

xml.dom.Element

The elemtn with the given ID, or None to indicate no match.

xml.dom.ext.GetAllNs

xml.dom.ext.GetAllNs(node)

Returns all the namespaces in effect on the given node, including the default namespace and the xml namespace.
Parameters

node of type xml.dom.Node

The node for which all in-scope namespaces are returned.

Return Value

doctionary

Dictionary mapping all in-scope namespaces to URIs, with '' as prefix for the default namespace.

xml.dom.ext.XmlSpaceState

xml.dom.ext.XmlSpaceState(node)

Determines whether the xml:space state at a given node is "preserve" or "default" (See the XML 1.0 spec).
Parameters

node of type xml.dom.Node

The node whose space state is to be found.

Return Value

string

"preserve" or "default".

xml.dom.ext.SplitQName

xml.dom.ext.SplitQName(qname)

Splits a valid QName from the XML Namespaces 1.0 spec into prefix and suffix (the local name in the case of element and attribute names, and the declared prefix in the case of namespace declarations.
Parameters

qname of type string matching QName production in XML Namespaces 1.0 spec

The name to be split.

Return Value

tuple with 2 items.

a tuple of the form (prefix, suffix). If there is exactly one colon in the qname, prefix is the part before and suffix the part after the colon. Otherwise prefix is '' and suffix is the entire input string.

Method Summary
`fromStream`	return a 4DOM node from the given stream
`fromString`	return a 4DOM node from the given string
`fromUri`	return a 4DOM node from the given uri

Method Summary
`fromStream`	return a 4DOM node from the given stream
`fromString`	return a 4DOM node from the given string
`fromUri`	return a 4DOM node from the given uri

Class Summary
`Reader`	Reusable utility to read XML documents.

Class Summary
`Reader`	Reusable utility to read HTML documents.