<?xml version="1.0"?>
<!DOCTYPE slideshow SYSTEM "../../slides/slides.dtd">
<slideshow> 
	 <title>EasySAX: Sax Made Pythonic</title> 
	 <slides> 
 
		  <slide>
				<title>About Me</title> 
				<points> 
					 <point>Paul Prescod</point>
					 <point>ISOGEN Consulting Engineer</point>
					 <point>Professional Services Arm of
						  <a href="http://www.datachannel.com">DataChannel</a></point>
					 <point>Co-author, <a href="www.xmlbooks.com">XML
						  Handbook</a></point>
				</points> 
		  </slide>

		  <div> 
		  <title>Laying the foundations</title> 
 
		  <slide>
				<title>Overview</title> 
				<points> 
					 <point>EasySAX merges ideas from 
						  <subpoints> 
								<point><a href="www.megginson.com">SAX</a></point>
								<point><a href="http://www.w3c.org/TR">DOM</a></point>
								<point>XSLT</point>
								<point>DSSSL</point>
						  </subpoints></point>
					 <point>Probably most similar to "saxon" in Java world.</point>
				</points> 
		  </slide>

		  <slide>
				<title> What is Python?</title> 
				<points> 
					 <point>A high-level, object oriented, dynamically typed
						  programming <a href="http://www.python.org">language</a>.</point>
					 <point>Can be used for scripting, conversions,
						  abstraction-building.</point>
					 <point>A merge of the best ideas of Smalltalk, Perl and
						  Java?</point>
					 <point>What does it mean to be Pythonic?</point>
				</points> 
		  </slide>

		  <slide>
				<title>Meditate on this</title> 
				<points> 
					 <point>An object is Pythonic if it has the Python nature. 
						  <subpoints> 
								<point>There is no rule of thumb.</point>
								<point>There is no motto.</point>
								<point>There is no overriding design goal.</point>
								<point>There is only the oneness with the problems you
									 are trying to solve and the way you think about it.</point>
						  </subpoints></point>
				</points> 
		  </slide>

		  <slide>
				<title>But master....</title> 
				<points> 
					 <point>How do we design things that are Pythonic? 
						  <subpoints> 
								<point>Aristotle's golden mean</point>
								<point>Make them simple, but not too simple to get the
									 job done.</point>
								<point>Elegant, but not in a cutesy way.</point>
								<point>Flexible, but not at the expense of
									 clarity.</point>
								<point>Dynamic, but not at the expense of
									 maintainability.</point>
								<point>Dictionaries are KUEL. Use them alot.</point>
						  </subpoints></point>
				</points> 
		  </slide>

		  <slide>
				<title>Reflect on this</title> 
				<points> 
					 <point>At runtime it is possible to ask questions about
						  objects.</point>
					 <point>Reflection is a powerful tool.</point>
					 <point>It can be used in diabolical ways</point>
					 <point>When used virtuously it eases readability and
						  maintainance</point>
					 <point>This is Pythonic.</point>
				</points> 
		  </slide>

		  </div> 
		  <div> 
		  <title>Reflecting on SAX</title> 
 
		  <slide>
				<title>What is SAX?</title> 
				<points> 
					 <point>SAX is a low-level API to XML</point>
					 <point>Performance is a key consideration</point>
					 <point>It is simple, but not easy.</point>
					 <point>Increasingly, it is no longer simple.</point>
					 <point>It is relatively, but not completely, complete.</point>
				</points> 
		  </slide>

		  <slide>
				<title>Does SAX have the Python Nature?</title> 
				<points> 
					 <point>Complexity is Pythonic if it is hidden.</point>
					 <point>Good performance is Pythonic.</point>
					 <point>Standards conformance is Pythonic.</point>
					 <point>Re-inventing wheels is <emph>not</emph>
						  Pythonic.</point>
					 <point>Therefore we must use SAX, but hide it.</point>
				</points> 
		  </slide>

		  <slide>
				<title>What must be hidden?</title> 
				<points> 
					 <point>SAX character handling is inelegant. 
						  <subpoints> 
								<point>SAX gives you a pointer into a buffer.</point>
								<pre>
def characters(self,ch,start,length):
    print ch[start:start+length]</pre> 
								<point>Pythonistas would just expect a string
									 object.</point>
						  <pre>
def characters( self, chars ):
    print chars</pre> 
				</subpoints></point>
	 </points>
</slide>

<slide>
	 <title>Event Dispatching</title> 
	 <points> 
		  <point>SAX requires you to dispatch your own element events:</point>
		  <pre>class MyHandler( SaxHandler ):
  def startElement( self, typename, attrs ):
    if typename=="html":
      handleHTML(attrs)
    elif typename=="title":
      handleTitle(attrs)
    ...</pre> 
		  <point>Large switch statements do not have the Python nature.</point>
</points>
</slide>

<slide>
<title>Context</title>
<points> 
	 <point>SAX requires application programmer to take care of context.</point>
	 <pre>class MyHandler( SaxHandler ):
    def startElement( self, typename, attrs ):
        if typename=="html":
            handleHTML(attrs)
        elif typename=="title":
            titleMode=1
		      ...
    def characters(self,chars,start,length):
        if titleMode:
            print chars[start:length]</pre>
</points>
</slide>

<slide>
<title>SAX and Namespaces</title>
<points>
<point>Namespaces are the ultimate koan.</point>
<point>Do we keep prefixes?</point>
<point>How do we keep them?</point>
<point>How do we do comparisons?</point>
<point>How do we keep this all efficient?</point>
</points>
</slide>

<slide>
<title>Do SAX events suck?</title>
<points>
<point>No, they merely have not achieved enlightement.</point>
<point>Through a series of reincarnations we can move them towards
	 enlightenment.</point>
<point>At the end, is what is left still SAX?</point>
<point>Ponder.</point>
</points>
</slide>

<slide>
<title>Does the DOM suck?</title>
<points>
<point>The DOM is useful...but yicky</point>
<point>Nevertheless, it's major weakness is not a design flaw: 
	 <subpoints> 
		  <point>tree models are inherently weak at handling very large
				documents</point>
		  <point>this can be mitigated with an object database like ZODB</point>
		  <point>but you still need a lot of disk space</point>
	 </subpoints></point>
<point>A Pythonic SAX must have minimal memory requirements</point>
</points>
</slide>

</div>
<div>
<title>Towards a Pythonic SAX</title> 
 
<slide>
<title>First Principle</title>
<points>
<point>Do not reinvent the wheel.</point>
<point>Parsers can still "speak" SAX</point>
<point>Applications can use something more Pythonic</point>
<point>Raw SAX is still available for speed-critical M2M B2B XML EDI on WinCE
	 HPCs</point>
</points>
</slide>

<slide>
<title>Second Principle</title>
<points>
<point>Let's steal ideas wherever we can. 
	 <subpoints> 
		  <point>XSLT</point>
		  <point>DSSSL</point>
		  <point>DOM</point>
		  <point>Omnimark</point>
		  <point>Balise</point>
		  <point>...</point>
	 </subpoints></point>
</points>
</slide>

<slide>
<title>Stealing from the DOM</title>
<points>
<point>It takes humility to steal ideas from the DOM.</point>
<point>Therefore it is a productive exercise.</point>
<point> DOM 2 has a way of handling namespaces.</point>
<point>The DOM is really good at handling context: 
	 <subpoints> 
		  <point>node.parentNode</point>
		  <point>node.childNodes</point>
		  <point>node.childNodes[0]</point>
		  <point>node.attributes</point>
		  <point>node.getAttribute( "abc" )</point>
		  <point>node.parentNode.getAttribute( "abc" )</point>
	 </subpoints></point>
</points>
</slide>

<slide>
<title>SAX, meet DOM</title>
<points>
<point>Instead of dispatching strings and integers, we can dispatch
	 nodes:</point>
<pre>def startElement( self, elementNode ):
    ...
def endElement( self, elementNode ):
    ...
def text( self, textNode ):
    ...
def processingInstruction( self, piNode ):
    ...
def comment( self, commentNode ):
    ....</pre>
<point>This gives us a way to navigate around.</point>
</points>
</slide>

<slide>
<title>Leveraging context</title>
<points>
<point>Given that we have context...let's flaunt it!</point>
<pre>def handle_spam( self, textNode ):
    "figure/title/text()"
    print "Figure title:"+`textNode`

def handle_dead_parrot( self, textNode ):
    "section/title/text()"
    print "Section title:"+`textNode`
				print textNode.parentNode.\
          attributes["type"]</pre>
</points>
</slide>

<slide>
<title>Let's not get crazy</title>
<points>
<point>There are some rules...</point>
<point>Not all of the DOM is available (see next slide)</point>
<point>Handlers must be named handle_something</point>
<point>"something" is a symbolic label, not a eltypename</point>
<point>Particular nodes are matched against the XPath</point>
</points>
</slide>

<slide>
<title>How much DOM can we afford?</title>
<points>
<point>The "right" amount of DOM varies from application to
application.</point>
<point>In processing techdocs it is really useful to be able to have a complete
DOM for (e.g.) tables and figures.</point>
<point>Parent context is almost always useful and relatively cheap.</point>
<point>Therefore: always remember parents.</point>
<point>Otherwise, only build subtrees for regions of the document.</point>
</points>
</slide>

<slide>
<title>Selective Domination</title>
<points>
<pre>def handle_applets(self,elementNode):
    "applet as tree"
    for node in elementNode.childNodes:
        print node

def handle_tables( self, elementNode ):
    "table as tree"
    # do something
    self.processChildren( elementNode )
    # do something else</pre>
</points>
</slide>

<slide>
<title>ProcessChildren</title>
<points>
<point>Recursively invoke handler on children</point>
<point>Like DSSSL function of same name</point>
<point>Like XSLT apply-templates</point>
<point>Like Omnimark %c</point>
<point>Coming soon...processMatchingChildren</point>
</points>
</slide>

<slide>
<title>Other DOM costs</title>
<points>
<point>The DOM is large and getting larger.</point>
<point>It is complicated and redundant.</point>
<point>Most parsers don't generate most node types.</point>
<point>Most apps are read-only</point>
<point>It probably would not pass the Guido test.</point>
<point>Let's just make a subset: "minidom".</point>
</points>
</slide>

<slide>
<title>Namespaces</title>
<points>
<point>Namespaces can be registered before you start parsing.</point>
<point>You can fiddle with the namespace list while parsing (but would
you?)</point>
<point>You use prefixes in XPaths, just as in XSLT:</point>
<pre>class MyHandler( EasySAXHandler ) 
def __init__(self):
    self.registerNamespace( "xhtml", 
         "http://www.microsoft.com" )	

def handle_tables( self, elementNode ):
    "xhtml:table as tree"
    # do something
    self.processChildren( elementNode )
    # do something else</pre>
</points>
</slide>

<slide>
<title>Garbage Collection</title>
<points>
<point>Children know about their parents.</point>
<point>By default, parents do NOT know about their children.</point>
<point>When you build a tree, the parents do know about children.</point>
<point>The references from parents to children are destroyed when handler
completes.</point>
<point>"Weak references" would help here.</point>
</points>
</slide>

<slide>
<title>Credit where Due</title>
<points>
<point>I wrote "minidom"</point>
<point>James Clark wrote Expat</point>
<point><![CDATA[Dr. Dieter Maurer <dieter@handshake.de> ]]> wrote the biggest
component: the <a
href="http://www.dieter.handshake.de/pyprojects/pyxpath.html">XPath</a>
parser</point>
<point>Thanks to his good design, I could adapt it without any help from
him.</point>
<point><a href="http://www.xmetal.com">XMetaL</a> wrote these slides.</point>
</points>
</slide>

<slide>
<title>Todo...</title>
<points>
<point>Documentation, documentation, documentation.</point>
<point>Tree pruning?</point>
<point>User defined functions.</point>
<point>XSLT "modes"?</point>
<point>Other DOM facilities.</point>
<point>Python "libxml"?</point>
</points>
</slide>

</div> 
</slides>
</slideshow>
