<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>beautifulsoup &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/beautifulsoup/</link>
	<description>Feed of posts on WordPress.com tagged "beautifulsoup"</description>
	<pubDate>Wed, 02 Dec 2009 04:39:11 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[Parsing DTCC Part 1: PITA ]]></title>
<link>http://financialpython.wordpress.com/2009/09/15/parsing-dtcc-part-1-pita/</link>
<pubDate>Tue, 15 Sep 2009 07:15:58 +0000</pubDate>
<dc:creator>DK</dc:creator>
<guid>http://financialpython.wordpress.com/2009/09/15/parsing-dtcc-part-1-pita/</guid>
<description><![CDATA[In a previous post, I complained about the DTCC&#8217;s CDS data website and the one week lifespan o]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>In a <a href="http://financialpython.wordpress.com/2009/06/10/dtcc-cds-data/" target="_blank">previous post</a>, I complained about the DTCC&#8217;s CDS data website and the one week lifespan of the data published there. For those of you who don&#8217;t know, the <a href="http://www.dtcc.com/about/business/index.php">DTCC</a> clears and settles a massive number of transactions every day for multiple asset classes. It&#8217;s one of those financial institutions that doesn&#8217;t get much press but underpins the entire capital market.</p>
<p>Anyway, the recent crisis motivated the DTCC to publish weekly CDS (single name, index, and tranche) exposure data. A good idea, until one realizes the data goes up in smoke when the next week&#8217;s data arrives. Although DTCC recently added links to data for &#8220;a week ago&#8221;, &#8220;a month ago&#8221;, and &#8220;a year ago,&#8221; it&#8217;s still pretty inconvenient. So, if you want the data, you have to parse it yourself. I originally wanted to write a smart parser that would dynamically react to whatever format it encountered&#8230;I came to my senses and adopted a simpler approach.</p>
<p>The approach thus far:</p>
<ul>
<li><strong>Download the raw html pages/files via &#8220;<a href="http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man1/curl.1.html">curl</a>.&#8221;</strong> <a href="http://docs.python.org/library/urllib2.html">Urllib2</a> is the preferred method to pull web pages, but I didn&#8217;t have the patience to figure out how to handle redirects. Curl is a utility included with OS X that, for whatever reason, ignores redirects automatically. As such, I created a short python script to download the html for all the tables of interest weekly.</li>
<li><strong>Use <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> to parse the html</strong>. Other libraries, such as <a href="http://code.google.com/p/html5lib/">html5lib</a> and <a href="http://codespeak.net/lxml/">lxml</a> seem to be gaining ground on BeautifulSoup, particularly as it&#8217;s author wants to get out of the parsing game altogether. Nevertheless, I couldn&#8217;t be bothered to figure out the unicode issues I experienced with html5lib or lxml&#8217;s logic. BeautifulSoup is straightforward and &#8220;gives you unicode, dammit!&#8221; (quoting the author).</li>
<li><strong>Use <a href="http://numpy.scipy.org/">numpy</a> for easier data manipulation</strong>. Since my html, css, DOM, etc. knowledge is basic, I thought it might be better to use numpy to manipulate the table data rather than rely solely on the parser. This meant vectorizing the html data into a 1D array, cleaning it up, and generally preparing it for future reshaping. Numpy, how did I ever live without you?</li>
</ul>
<p>This would&#8217;ve been much easier if all the tables were exactly the same format. Unfortunately, that&#8217;s never the case. An extra cell here or there, or weird characters, can throw things off. This isn&#8217;t an issue if you are parsing individual pieces of data or a single table. But what if you need to parse ten, 20, 100, etc. tables? It can get ugly fast. The DTCC data is broken into 23 pages, some of which have multiple tables. Luckily, most of my pain was self-inflicted (hey, I&#8217;m a parsing virgin). I only had to account for a few different table formats in the end.</p>
<p>One downside to my approach is I do not dynamically produce headers for the data I&#8217;m pulling. I plan to manually set the headers for each table (the ultimate destination for the data right now are csv files). If there&#8217;s a better way, please let me know.</p>
<p>You can <a href="http://financialpython.pastebin.com/f4efd8930">find the code here</a> via pastebin (feedback is welcome).<br />
You can find the DTCC tables <a href="http://www.dtcc.com/products/derivserv/data/index.php">here</a> (if you want to view the html source).</p>
<p>Part 2 will cover the process of reformatting the data with numpy and perhaps feature some charts. I&#8217;m very curious to see what the numbers show!</p>
<p>Here are a few screenshots of a terminal session using the code so far:</p>
<p><a href="http://posterous.com/getfile/files.posterous.com/notestoself/nIwohfIpdFs3b5Vo5vCzrBV9C9vYKk7EszVQqXpctDofQg2FbeGAoEHzZxZf/Picture_1.png"><img src="http://posterous.com/getfile/files.posterous.com/notestoself/S1CqZFF6m50EA4QgcMXMeEBDqQa55RDzePALPkt3a14w5HjZ7x6fZZeoayP7/Picture_1.png.scaled.500.jpg" alt="" width="500" /></a><a href="http://posterous.com/getfile/files.posterous.com/notestoself/eAYswPXNlBLz3nP9JbF07MNqBy9umxnsToVgjDKIYoU3QpegsNd1g5v3PVFT/Picture_2.png"><img src="http://posterous.com/getfile/files.posterous.com/notestoself/IWnNTofDmi7jJaM1GjVKuiqpmXTHDs11JOqFtLWPPYuxcwuT7zNikqvEb9zP/Picture_2.png.scaled.500.jpg" alt="" width="500" /></a><a href="http://posterous.com/getfile/files.posterous.com/notestoself/TchhVDu7YsNaXryAdwbNk3PBrt2cUhUQrgUYHisOgJQH5n81z5vjef2KBVKc/Picture_3.png"><img src="http://posterous.com/getfile/files.posterous.com/notestoself/LML6HletFCSneixJGWURkvXZUSvuJbODAV4yOHn4sH002ALYmVmR8TIZAFXy/Picture_3.png.scaled.500.jpg" alt="" width="500" /></a><a href="http://posterous.com/getfile/files.posterous.com/notestoself/6u5FoORnapm3CifpiblBiIulx8syKJR2leLKkSKOnxJLGokicOIfHypqhHdr/Picture_4.png"><img src="http://posterous.com/getfile/files.posterous.com/notestoself/7Ge023nhJTth4RNBkUsqJEaOFMruo436gdMabbKsDIC095EqJwqiaNVD2tYz/Picture_4.png.scaled.500.jpg" alt="" width="500" /></a><br />
<a href="http://notestoself.posterous.com/parsing-dtcc-part-1-pita">See and download the full gallery on posterous</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Parsing DTCC Part 1: PITA ]]></title>
<link>http://morenotestoself.wordpress.com/2009/09/15/parsing-dtcc-part-1-pita/</link>
<pubDate>Tue, 15 Sep 2009 07:15:58 +0000</pubDate>
<dc:creator>DK</dc:creator>
<guid>http://morenotestoself.wordpress.com/2009/09/15/parsing-dtcc-part-1-pita/</guid>
<description><![CDATA[In a previous post, I complained about the DTCC&#8217;s CDS data website and the one week lifespan o]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>In a previous post, I complained about the DTCC&#8217;s CDS data website and the one week lifespan of the data published there. For those of you who don&#8217;t know, the <a href="http://www.dtcc.com/about/business/index.php">DTCC</a> clears and settles a massive number of transactions every day for multiple asset classes. It&#8217;s one of those financial institutions that doesn&#8217;t get much press but underpins the entire capital market.</p>
<p>Anyway, the recent crisis motivated the DTCC to publish weekly CDS (single name, index, and tranche) exposure data. A good idea, until one realizes the data goes up in smoke when the next week&#8217;s data arrives. Although DTCC recently added links to data for &#8220;a week ago&#8221;, &#8220;a month ago&#8221;, and &#8220;a year ago,&#8221; it&#8217;s still pretty inconvenient. So, if you want the data, you have to parse it yourself. I originally wanted to write a smart parser that would dynamically react to whatever format it encountered&#8230;I came to my senses and adopted a simpler approach.</p>
<p>The approach thus far:</p>
<ul>
<li><strong>Download the raw html pages/files via &#8220;<a href="http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man1/curl.1.html">curl</a>.&#8221;</strong> <a href="http://docs.python.org/library/urllib2.html">Urllib2</a> is the preferred method to pull web pages, but I didn&#8217;t have the patience to figure out how to handle redirects. Curl is a utility included with OS X that, for whatever reason, ignores redirects automatically. As such, I created a short python script to download the html for all the tables of interest weekly.</li>
<li><strong>Use <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> to parse the html</strong>. Other libraries, such as <a href="http://code.google.com/p/html5lib/">html5lib</a> and <a href="http://codespeak.net/lxml/">lxml</a> seem to be gaining ground on BeautifulSoup, particularly as it&#8217;s author wants to get out of the parsing game altogether. Nevertheless, I couldn&#8217;t be bothered to figure out the unicode issues I experienced with html5lib or lxml&#8217;s logic. BeautifulSoup is straightforward and &#8220;gives you unicode, dammit!&#8221; (quoting the author).</li>
<li><strong>Use <a href="http://numpy.scipy.org/">numpy</a> for easier data manipulation</strong>. Since my html, css, DOM, etc. knowledge is basic, I thought it might be better to use numpy to manipulate the table data rather than rely solely on the parser. This meant vectorizing the html data into a 1D array, cleaning it up, and generally preparing it for future reshaping. Numpy, how did I ever live without you?</li>
</ul>
<p>This would&#8217;ve been much easier if all the tables were exactly the same format. Unfortunately, that&#8217;s never the case. An extra cell here or there, or weird characters, can throw things off. This isn&#8217;t an issue if you are parsing individual pieces of data or a single table. But what if you need to parse ten, 20, 100, etc. tables? It can get ugly fast. The DTCC data is broken into 23 pages, some of which have multiple tables. Luckily, most of my pain was self-inflicted (hey, I&#8217;m a parsing virgin). I only had to account for a few different table formats in the end.</p>
<p>One downside to my approach is I do not dynamically produce headers for the data I&#8217;m pulling. I plan to manually set the headers for each table (the ultimate destination for the data right now are csv files). If there&#8217;s a better way, please let me know.</p>
<p>You can <a href="http://financialpython.pastebin.com/f4efd8930">find the code here</a> via pastebin (feedback is welcome).<br />
You can find the DTCC tables <a href="http://www.dtcc.com/products/derivserv/data/index.php">here</a> (if you want to view the html source).</p>
<p>Part 2 will cover the process of reformatting the data with numpy and perhaps feature some charts. I&#8217;m very curious to see what the numbers show!</p>
<p>Here are a few screenshots of a terminal session using the code so far:</p>
<p><a href="http://posterous.com/getfile/files.posterous.com/notestoself/nIwohfIpdFs3b5Vo5vCzrBV9C9vYKk7EszVQqXpctDofQg2FbeGAoEHzZxZf/Picture_1.png"><img src="http://posterous.com/getfile/files.posterous.com/notestoself/S1CqZFF6m50EA4QgcMXMeEBDqQa55RDzePALPkt3a14w5HjZ7x6fZZeoayP7/Picture_1.png.scaled.500.jpg" alt="" width="500" /></a><a href="http://posterous.com/getfile/files.posterous.com/notestoself/eAYswPXNlBLz3nP9JbF07MNqBy9umxnsToVgjDKIYoU3QpegsNd1g5v3PVFT/Picture_2.png"><img src="http://posterous.com/getfile/files.posterous.com/notestoself/IWnNTofDmi7jJaM1GjVKuiqpmXTHDs11JOqFtLWPPYuxcwuT7zNikqvEb9zP/Picture_2.png.scaled.500.jpg" alt="" width="500" /></a><a href="http://posterous.com/getfile/files.posterous.com/notestoself/TchhVDu7YsNaXryAdwbNk3PBrt2cUhUQrgUYHisOgJQH5n81z5vjef2KBVKc/Picture_3.png"><img src="http://posterous.com/getfile/files.posterous.com/notestoself/LML6HletFCSneixJGWURkvXZUSvuJbODAV4yOHn4sH002ALYmVmR8TIZAFXy/Picture_3.png.scaled.500.jpg" alt="" width="500" /></a>   <a href="http://posterous.com/getfile/files.posterous.com/notestoself/6u5FoORnapm3CifpiblBiIulx8syKJR2leLKkSKOnxJLGokicOIfHypqhHdr/Picture_4.png"><img src="http://posterous.com/getfile/files.posterous.com/notestoself/7Ge023nhJTth4RNBkUsqJEaOFMruo436gdMabbKsDIC095EqJwqiaNVD2tYz/Picture_4.png.scaled.500.jpg" alt="" width="500" /></a> <a href="http://notestoself.posterous.com/parsing-dtcc-part-1-pita">See and download the full gallery on posterous</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[mais sobre crawlers e spiders]]></title>
<link>http://fiorix.wordpress.com/2009/09/09/mais-sobre-crawlers-e-spiders/</link>
<pubDate>Wed, 09 Sep 2009 07:48:19 +0000</pubDate>
<dc:creator>alef</dc:creator>
<guid>http://fiorix.wordpress.com/2009/09/09/mais-sobre-crawlers-e-spiders/</guid>
<description><![CDATA[No mês passado escrevi um artigo com um programa para capturar todos os items da primeira página de ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><img class="alignleft size-full wp-image-345" style="border:5px solid white;" title="logo_mercadolivre" src="http://fiorix.wordpress.com/files/2009/08/logo_mercadolivre.jpg" alt="logo_mercadolivre" width="62" height="62" />No mês passado escrevi um <a href="http://fiorix.wordpress.com/2009/08/19/twisted-crawler-alvo-mercadolivre/">artigo com um programa para capturar todos os items da primeira página de cada categoria do MercadoLivre</a>.</p>
<p>Lá, lidava com alguns problemas como:</p>
<ul>
<li>limite de concorrência no download das páginas</li>
<li>processamento de html em thread, síncrono</li>
<li>manter a maior parte do processo assíncrono, para ganhar tempo e CPU</li>
</ul>
<p>Depois disso, precisei fazer umas alterações no código e acabei modificando um pouco programa, usando outras técnicas como:</p>
<ul>
<li><a href="http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.task.Cooperator.html">cooperação de tarefas para limitar a concorrência</a></li>
<li><a href="http://codespeak.net/lxml/">processamento de html inline, mais rápido</a> (pois usa backend em C)</li>
<li><a href="http://twistedmatrix.com/documents/8.1.0/api/twisted.internet.defer.html#inlineCallbacks">encadear algumas funções, executando inline</a></li>
</ul>
<p>Cada item desta lista corresponde aos itens da lista mais acima, respectivamente.</p>
<p>O esquema de cooperação do twisted é muito melhor que o <em>Controller</em> que havia criado anteriormente. Porém, muito mais complicado para jovens aprendizes. Recomendo <a href="http://jcalderone.livejournal.com/24285.html">este link</a> para mais detalhes.</p>
<p>Sobre o processamento do html, vale a pena <a href="http://codespeak.net/lxml/lxmlhtml.html">verificar o lxml</a>. Antes, havia usado <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>, que é muito bom, mas perde violentamente em desempenho e suporte a <a href="http://www.microsoft.com/FRONTPAGE/">broken-html</a>.</p>
<p>Por fim, o truque de usar <a href="http://blog.mekk.waw.pl/archives/14-Twisted-inlineCallbacks-and-deferredGenerator.html"><em>generators</em> para executar alguns callbacks inline</a> é incrível, e absurdamente prático em casos como esse, do programa abaixo.</p>
<p>O resultado é final é o mesmo, mas a melhoria em desempenho é absurda. Fiz alguns testes na minha máquina e obtive o seguinte:</p>
<ul>
<li>esta versão consome, em média, 20% menos de CPU</li>
<li>como não há necessidade de gravar os arquivos no disco, não consome disco</li>
<li>o processo todo ficou 657% mais rápido, <a href="http://www.youtube.com/watch?v=eKcHQPEYGcI">simplesmente</a></li>
<li>ainda, o código é muito menor +_+</li>
</ul>
<p>Veja ai:</p>
<pre class="brush: python;">
#!/usr/bin/env python
# coding: utf-8

from lxml import html
from twisted.web import client
from twisted.python import log
from twisted.internet import task, defer, reactor

class MercadoLivre:
    def __str__(self):
        return 'http://www.mercadolivre.com.br/jm/ml.allcategs.AllCategsServlet'

    def parse_categories(self, content):
        category = subcategory = ''
        doc = html.fromstring(content)
        for link in doc.iterlinks():
            el, attr, href, offset = link
            try: category = el.find_class('categ')[0].text_content()
            except: pass
            else: continue
            if category:
                try: subcategory = el.find_class('seglnk')[0].text_content()
                except: continue
                else: yield (href, category, subcategory)

    def parse_subcategory(self, content):
        doc = html.fromstring(content)
        for element in doc.find_class('col_titulo'):
            yield element[0].text_content()

class Engine:
    def finish(self, result):
        reactor.stop()

    @defer.inlineCallbacks
    def fetch_categories(self, link, parser):
        try:
            doc = yield client.getPage(link)
            defer.returnValue(parser(doc))
        except Exception, e:
            print e

    def fetch_subcategory(self, links, parser, limit):
        coop = task.Cooperator()
        work = (client.getPage(link[0]).addCallback(parser).addCallback(self.page_items, *link) for link in links)
        result = defer.DeferredList([coop.coiterate(work) for x in xrange(limit)])
        result.addCallback(self.finish)
        result.addErrback(log.err)

    def page_items(self, items, href, category, subcategory):
        print 'Categoria: %s / %s' % (category.encode('utf-8'), subcategory.encode('utf-8'))
        for item in items: print ' -&gt; %s' % item.encode('utf-8')
        print ''

def main(limit, *parsers):
    e = Engine()
    for parser in parsers:
        links = e.fetch_categories(str(parser), parser.parse_categories)
        links.addCallback(e.fetch_subcategory, parser.parse_subcategory, limit)

if __name__ == '__main__':
    reactor.callWhenRunning(main, 150, MercadoLivre())
    reactor.run()
</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[List of useful Python libraries]]></title>
<link>http://manishtech.wordpress.com/2009/09/04/list-of-useful-python-libraries/</link>
<pubDate>Thu, 03 Sep 2009 18:46:37 +0000</pubDate>
<dc:creator>Manish</dc:creator>
<guid>http://manishtech.wordpress.com/2009/09/04/list-of-useful-python-libraries/</guid>
<description><![CDATA[If you are a .NET programmer, then you find Python a bit tough. Reason? Python does not include libr]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>If you are a .NET programmer, then you find Python a bit tough. Reason? Python does not include library for each and every operation possible in this world. You may have to work around to find the necessary packages, download them and continue with your development.</p>
<p>Python&#8217;s standard module list has a finite number of entries as opposed to .NET    (<em> I use .NET at my workplace</em>). Here this is an attempt to collect all such libraries which are outside standard modules, which you might badly need for your development. Many of them are extensions or wrapper packages for already existing libraries.</p>
<h3>1 ) <a href="http://www.secdev.org/projects/scapy/">scapy</a></h3>
<p>This is a library for  TCP/IP stack wherein you can have full control over the lowest detail of the Packet that leaves your computer. It supports many protocols like ETH, IP, ARP, ICMP, TCP, UDP etc. You can create custom TCP/IP packet and send it to any host. Typical implementation is ARP Ping, ICMP Ping.</p>
<p><strong>Experience</strong>: <span style="color:#008000;">Tried. Works perfectly</span>. Havn&#8217;t stumbled across any bugs as of now.</p>
<h3>2) <a href="http://trac.optio.webfactional.com/wiki/soaplib">soaplib</a></h3>
<p>Used for creating lightweight web services. As the page says, it comes with a client and server built in and on-demand WSDL generation.</p>
<p><strong>Experience</strong>: <span style="color:#ff0000;">Havn&#8217;t tried</span>. Heard about it&#8217;s existence.</p>
<h3>3) <a href="http://mysql-python.sourceforge.net/MySQLdb.html">mysql</a></h3>
<p>Uh? Do I really have to tell what this is actually. I hope everyone knows.</p>
<p>Documentation for <a href="http://mysql-python.sourceforge.net/MySQLdb.html#mysqldb">python-mysql</a></p>
<p><strong>Experience</strong>: <span style="color:#008000;">Obviously! Obviously!</span> I think I should remove this line.</p>
<h3>4) <a href="http://aubio.org/">aubio</a></h3>
<p>Stating directly from it&#8217;s site &#8211; <em>&#8220;aubio is a library for audio labelling. Its features include segmenting a sound file before each of its attacks, performing pitch detection, tapping the beat and producing midi streams from live audio. The name aubio comes from &#8216;audio&#8217; with a typo&#8221;</em></p>
<p><strong>Experience</strong>: <span style="color:#ff0000;">None</span>. Presently in To-Do List.</p>
<h3>5) <a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a></h3>
<p>BeautifulSoup is an SGML parser which is highly robust and doesn&#8217;t die straight-off even if you give it poorly formed data. To make it scream and die all you have to do is to give something that isn&#8217;t SGML at all. It even has a parser class named BeautifulSOAP which is used to parse SOAP message (as the name applies). It even has a class named ICantBelieveItsBeautifulSoup. Sounds stupid? Who cares as long as it does it work.</p>
<p><strong>Experience</strong>: <span style="color:#ff6600;">Tried</span> when I saw Anomit using it. Need more experience as I have lost touch as of now. Never tried BeautifulSOAP.</p>
<h3>6) <a href="http://xael.org/norman/python/pyclamav/">python-clamav</a></h3>
<p>It is pending in my To-Do list. Will start working as I get time.  Check a <a href="http://xael.org/norman/python/pyclamav/#usage">small tutorial</a></p>
<p><strong>Experience</strong>: <span style="color:#ff0000;">No!</span> Read the line above.</p>
<h3>7) <a href="http://www.dlitz.net/software/pycrypto/">python-crypto</a></h3>
<p>Presently in #1 position of To-Do list. Sounds just too promising. Hope it is as I thought it to be.</p>
<p>Check the <a href="http://www.dlitz.net/software/pycrypto/apidoc/">API</a> and it&#8217;s <a href="http://www.dlitz.net/software/pycrypto/doc/">general overview</a></p>
<p><strong>Experience</strong>: <span style="color:#ff0000;">No</span></p>
<h3>8 ) <a href="http://www.djangoproject.com/">django</a></h3>
<p>Now if you don&#8217;t know django &#8211; Go shoot yourself or <a href="http://en.wikipedia.org/wiki/Django_%28web_framework%29">read about it here</a> if you somehow survive.</p>
<p><em>[ As pointed by Anomit, it isnt a framework, but library is a general name I have used for the title ]</em></p>
<p><strong>Experience</strong>: <span style="color:#008000;">Obviously!</span></p>
<h3>9) <a href="http://newcenturycomputers.net/projects/gdmodule.html">gd</a></h3>
<p>I have used GD a lot in PHP, but hardly on Python. GD is simpler than ImageMagik (<em>never used</em>) as people told me. Hope to use this library if I ever require.</p>
<p>If you ever require the documentation, head yourself to <a href="http://newcenturycomputers.net/projects/gd-ref.html">this page</a>.</p>
<p><strong>Experience</strong>: <span style="color:#ff6600;">Not used in Python, but in PHP</span></p>
<h3>10) <a href="http://gmplib.org/">gmp</a></h3>
<p>GMP stands for GMP Multi Precision and <a href="http://gmpy.sourceforge.net/">gmpy</a> is a python wrapper over it. Though you might not need it in Python, but if you are coming from C background, this might be a familiar name.</p>
<p><strong>Experience</strong>: <span style="color:#ff6600;">Normal, not an expert</span></p>
<h3>11) <a href="http://jabberpy.sourceforge.net/">python-jabber</a></h3>
<p>Python-Jabber is a python module which implements jabber instant messaging protocol. Check out the <a href="http://jabberpy.sourceforge.net/docs/jabber.html">documentation</a> and a <a href="http://sudharsh.wordpress.com/2009/07/25/nai-sekar-your-not-so-friendly-translation-bot/">funny example</a> .</p>
<p><strong>Experience</strong>: <span style="color:#ff6600;">Little experience</span>. Not much. After all it doesn&#8217;t look so tough, so will sit down for a hacking session,</p>
<h3>12) <a href="http://python-irclib.sourceforge.net/">python-irclib</a></h3>
<p>I encountered this library when I was searching more on python-jabber library. This also falls in the category of real-time messaging. The problem I can see is that there is no documentation. How to proceed? Use dir() and <a href="http://docs.python.org/library/inspect.html">inspect</a> module extensively?</p>
<p><strong>Experience</strong>: <span style="color:#ff0000;">Kidding?</span> Please show me the documentation. I don&#8217;t have more time for hacks as I did with scapy.</p>
<p>&#8211;</p>
<p>Till now ,I had kept this list for my own reference. Many more required libraries are missing. If you have any more in mind, please mention it. I would be glad to add it.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[pyQuery]]></title>
<link>http://caiomoritz.com/2009/08/06/pyquery/</link>
<pubDate>Thu, 06 Aug 2009 21:25:45 +0000</pubDate>
<dc:creator>Caio Moritz Ronchi</dc:creator>
<guid>http://caiomoritz.com/2009/08/06/pyquery/</guid>
<description><![CDATA[A biblioteca pyQuery promete. Eu, viciado que sou por screen scrapping, achei que os meus problemas ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>A biblioteca <a href="http://pyquery.org/" target="_blank">pyQuery</a> promete. Eu, viciado que sou por <em>screen scrapping</em>, achei que os meus problemas tinham acabado quando baixei a versão <a href="http://bitbucket.org/olauzanne/pyquery/get/0.3.1.zip" target="_blank">0.3.1</a> hoje. A idéia da biblioteca é ser para o Python o que o jQuery é para o JavaScript, uma excelente mão na roda para manipulação de HTML (e XML).</p>
<p>O problema atual dela é que a parte que mais me interessa, <em>traversing</em>, não foi completamente implementada (preciso, a partir de um elemento, pular para o próximo, por exemplo). O <a href="http://bitbucket.org/olauzanne/pyquery/changesets/" target="_blank">timeline</a> do projeto mostra que o desenvolvimento é esporádico. Por isso estou tentando utilizar uma biblioteca mais tradicional, a <a href="http://www.crummy.com/software/BeautifulSoup/" target="_blank">BeautifulSoup</a>. Não está sendo fácil.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[First library for my CMS started!]]></title>
<link>http://metapep.wordpress.com/2009/04/01/first-library-for-my-cms-started/</link>
<pubDate>Wed, 01 Apr 2009 18:53:07 +0000</pubDate>
<dc:creator>pepijndevos</dc:creator>
<guid>http://metapep.wordpress.com/2009/04/01/first-library-for-my-cms-started/</guid>
<description><![CDATA[I finaly started coding for my CMS. I&#8217;m not sure if I&#8217;m already up to the job, but I]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I finaly started coding for my CMS. I&#8217;m not sure if I&#8217;m already up to the job, but I&#8217;ll find out soon enough. The first library is part of the template engine. I don&#8217;t want people to write crap HTML, so I wrote a library that can generate strict XHTML with python code that is as similar to html as possible, see below for an example.</p>
<p>The big question remains, what to do with user content, even valid XHTML theme break if someone pastes his Word HTML in the editor. I plan on using <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> to parse the user code, strip off the &#8216;invalid&#8217; tags like &#60;i&#62;, &#60;u&#62;, &#60;iframe&#62;, etc&#8230; and feed it to my XHTMLLib. The even cooler part is that you could use your old HTML theme, feed it to my lib and fill/replace the content. With the latest version this became even easyer because I added some sort of xpath support(absolute paths only, but with attribute selection).</p>
<p>I present to you, my first lib!</p>
<p>Link: <a href="http://www.box.net/shared/6x01eol1x5">http://www.box.net/shared/6&#215;01eol1&#215;5</a> Also available in the sidebar widget. You need xhtmlattr.py as well!</p>
<p>Example:</p>
<pre>page = XHTMLLib().template() # generate a basic XHTML template
    page.xpath('/html/head/title')['child'] = "Hello world" # set the title

    page.function_factory(['div', 'img', 'fieldset', 'input', 'ul', 'li'], '', '') # register functions

    lists = [li(class_='xtra', child="Home"), li(class_="test", child="Links"), li(child="Contact"), li(class_="test", child="About"), li(child="Sitemap")] # create a list
    lists.sort() # sort elements

    page.xpath('/html/body')['child'] = [ # add to the body
        div(id_='header', child='test &#38; more'), # Entities get converted
        img(src='image.jpg'),
        fieldset(class_='test', child=[
            input(type_='text', name='name'),
           input(type_='submit')
        ]),
        ul(child=lists) # Comppose pages of different object sets.
   ]

    page.xpath('/html/body/img').__setitem__('onclick', 'alert("!@#$^, stop clicking!")', True) # Voodoo magic to get evil js in, alsays use external js!
    page.xpath('/html/body/ul/li[class=test][4]')['style'] = 'color:red' # Some more 'advanced xpath' to get a specific list item

    print page.render();

result:

&#60;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&#62;
&#60;html&#62;
	&#60;head&#62;
		&#60;meta content="application/xhtml+xml;charset=utf-8" http-equiv="Content-Type"&#62;&#60;/meta&#62;
		&#60;title&#62;Hello world&#60;/title&#62;
	&#60;/head&#62;
	&#60;body&#62;
		&#60;div id="header"&#62;test &#38;amp; more&#60;/div&#62;
		&#60;img src="image.jpg" onclick="&#60;![CDATA[alert("!@#$^, stop clicking!")]]&#62;"&#62;&#60;/img&#62;
		&#60;fieldset class="test"&#62;
			&#60;input type="text" name="name"&#62;&#60;/input&#62;
			&#60;input type="submit"&#62;&#60;/input&#62;
		&#60;/fieldset&#62;
		&#60;ul&#62;
			&#60;li&#62;Contact&#60;/li&#62;
			&#60;li&#62;Sitemap&#60;/li&#62;
			&#60;li class="test"&#62;About&#60;/li&#62;
			&#60;li class="xtra"&#62;Home&#60;/li&#62;
			&#60;li style="color:red" class="test"&#62;Links&#60;/li&#62;
		&#60;/ul&#62;
	&#60;/body&#62;
&#60;/html&#62;</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Is MiNiML Enough?]]></title>
<link>http://megamicrobase.wordpress.com/2009/02/25/is-miniml-enough/</link>
<pubDate>Wed, 25 Feb 2009 05:31:15 +0000</pubDate>
<dc:creator>willdampier</dc:creator>
<guid>http://megamicrobase.wordpress.com/2009/02/25/is-miniml-enough/</guid>
<description><![CDATA[Hi all, So I started parsing out some of the data from the GEO records for the automated prediction.]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Hi all,</p>
<p>So I started parsing out some of the data from the GEO records for the automated prediction.  NCBI has both made my job slightly easier and immeasurably harder at the same time.</p>
<p>First the good, in 2006 NCBI adopted the <a href="http://www.ncbi.nlm.nih.gov/projects/geo/info/MIAME.html">MIAME </a>(Minimal Information About a Microarray Experiment).  Shortly afterwards it released MiNiML, an XML based representation of the MIAME information.  These XML files are available for every GSM record.  With only a few lines of python code I can effortlessly extract the relevant feilds from the data.  Although I have noticed that I need to use <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup </a>to clean up the errant missing closing tag.</p>
<p>Now the bad, most of the feilds in the MIAME are &#8220;comment&#8221; fields.  These are free text explanations of the &#8220;Data Source&#8221;, &#8220;Data Description&#8221;, &#8220;Source Description&#8221;, etc.  I&#8217;m finding two main difficulties:</p>
<ol>
<li>When the information in the field references information from the reference paper.</li>
<li>When the experimental procedures reference things OUTSIDE the scope of a &#8220;disease&#8221;.</li>
</ol>
<p>I really don&#8217;t have a good automated solution for either of these problems.  For problem #1 it would require a method to extract the relationship from the paper itself &#8230; its my understanding is that this level of extraction is not possible with NLP.  Problem #2 is a little more expansive.  Some datasets study gene deletions, expression vectors, transformed cell-lines, etc.  I&#8217;m not quite even sure how to classify these into  an ontology myself, so I have no idea how to get the computer to do it.</p>
<p>Ulltimately I just need the computer to recognize when its come across these sorts of records and just ignore them.  However, this is a more difficult problem in machine learning then it seems.  I&#8217;ll probably need to make a specific set of tags for &#8220;unclassifiable&#8221; sentences.</p>
<p>Any thoughts?</p>
<p>-Will</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Dependencies you can Depend on]]></title>
<link>http://megamicrobase.wordpress.com/2009/02/17/dependencies-you-can-depend-on/</link>
<pubDate>Tue, 17 Feb 2009 06:27:41 +0000</pubDate>
<dc:creator>willdampier</dc:creator>
<guid>http://megamicrobase.wordpress.com/2009/02/17/dependencies-you-can-depend-on/</guid>
<description><![CDATA[Hi all, Sadly nobody gets to program in a bubble.  Everyone has deadlines, legacy code, legacy compu]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Hi all,</p>
<p>Sadly nobody gets to program in a bubble.  Everyone has deadlines, legacy code, legacy computers, legacy programmers, etc.  We also don&#8217;t have unlimited time to &#8220;re-invent the wheel&#8221; at each opportunity.  Assumming you&#8217;re doing an open source project (much like this one), you can incorporate other people&#8217;s tools into your own code.</p>
<p>If I need to fix some automatically generated XML or HTML, instead of writing my own code, I can just import <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>.  If I need to render information into HTML or an e-mail, instead of writing my own code, I can just use <a href="http://jinja.pocoo.org/2/">jinja templates</a>.  If I need to make a GUI, instead of complaining about how people have forgotten (or never learned) how to use command line, I can just use <a href="http://www.wxpython.org/">wxPython</a>.</p>
<p>I&#8217;m new to the development world.  I&#8217;m usually the only user of my code, and it only needs to run on my computer.  When I submit code for publication I do a cursory check to make sure it runs on another computer, but ultimately I leave it up to the readers to figure out the dependencies.  However, this project is different.  I need to ensure that the code runs on any computer with only a minor amount of fiddling &#8230; especially since this may be used by technically unsavy users.</p>
<p>After plenty of googling and some half-brained ideas to use command-line calls to easy_install I found that setuptools allows a programmer to specify the dependencies.  Whenever I used easy_install I noticed that it would install dependencies, I just didn&#8217;t make the obvious leap that I could also define them in my own setup file.</p>
<p>In the next post I&#8217;ll discuss some of my trials and tribulations on ensuring that all of my code is packaged correctly.</p>
<p>-Will</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[paulistão 2009]]></title>
<link>http://fiorix.wordpress.com/2009/01/27/paulistao-2009/</link>
<pubDate>Wed, 28 Jan 2009 01:52:46 +0000</pubDate>
<dc:creator>alef</dc:creator>
<guid>http://fiorix.wordpress.com/2009/01/27/paulistao-2009/</guid>
<description><![CDATA[Aproveitando que meu time está em primeiro, aproveitei pra colocar mais um crawler com BautifulSoup,]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><img class="alignleft size-medium wp-image-268" title="palmeiras" src="http://fiorix.wordpress.com/files/2009/01/palmeiras.png?w=300" alt="palmeiras" width="103" height="103" />Aproveitando que meu time está em primeiro, aproveitei pra colocar mais um crawler com <a href="http://www.crummy.com/software/BeautifulSoup/">BautifulSoup</a>, só que desta vez usando <a href="http://twistedmatrix.com/">twisted</a>.</p>
<p>A fonte de informação é o <a href="http://globoesporte.globo.com/Esportes/Futebol/Classificacao/0,,ESP0-9839,00.html">Globo Esporte</a>, e pela leve bagunça que há no HTML e CSS, o código ficou meio feioso. Além disso, consegui deixar ainda mais feio pra poder imprimir bonito no terminal.</p>
<pre class="brush: python;">#!/usr/bin/env python
# coding: utf-8
# 20090127 AF - paulistão 2009, crawler

import re
from sys import stdout
from twisted.internet import reactor
from twisted.web.client import getPage
from BeautifulSoup import BeautifulSoup

crlft = re.compile(r'[\r\n\t]*')
URL = 'http://globoesporte.globo.com/Esportes/Futebol/Classificacao/0,,ESP0-9839,00.html'

def failure(err):
    print(err)
    reactor.stop()

def parser(contents):
    current = 1
    columns = 10

    f = lambda m: crlft.sub('', m.contents[0]+' ')
    soup = BeautifulSoup(contents, convertEntities='html')
    classes = ['borda borda-forte', 'borda semborda', ' borda-forte',
                ' semborda', 'time borda', 'borda', 'time ', '']

    stdout.write('% 18s' % 'TIME')
    for label in ['P', 'J', 'V', 'E', 'D', 'GP', 'GC', 'SG', '(%)']:
        stdout.write('% 5s' % label)
    stdout.write('\n')

    for item in soup.findAll('td', {'class':classes}):
        if item.find('span'):
            temp = '%02d. % 15s' % ((current / columns)+1,
                item.span.find('a') and f(item.span.a) or f(item.span))
            stdout.write(temp.encode('utf-8'))
        else:
            stdout.write('% 5s' % f(item))
            if not current % columns: stdout.write('\n')
        current += 1
    reactor.stop()

if __name__ == '__main__':
    deferred = getPage(URL)
    deferred.addCallback(parser)
    deferred.addErrback(failure)
    reactor.run()
</pre>
<p>E o resultado, no terminal:</p>
<pre>$ python paulista.py
              TIME    P    J    V    E    D   GP   GC   SG  (%)
01.      Palmeiras    9    3    3    0    0    7    0    7  100
02.         Santos    6    2    2    0    0    4    1    3  100
03.    São Caetano    6    2    2    0    0    3    0    3  100
04.        Guarani    6    2    2    0    0    2    0    2  100
05.       Mirassol    4    2    1    1    0    5    3    2   66
06.      São Paulo    4    2    1    1    0    3    1    2   66
07.    Corinthians    4    2    1    1    0    3    2    1   66
08.    Ponte Preta    4    3    1    1    1    2    1    1   44
09.     Bragantino    3    2    1    0    1    4    3    1   50
10.       Paulista    3    2    1    0    1    2    2    0   50
11.    Santo André    3    3    1    0    2    1    2   -1   33
12.        Barueri    2    2    0    2    0    4    4    0   33
13.          Oeste    2    2    0    2    0    2    2    0   33
14.         Ituano    1    2    0    1    1    1    2   -1   16
15.    Botafogo-SP    1    2    0    1    1    4    6   -2   16
16.  Guaratinguetá    1    2    0    1    1    2    4   -2   16
17.        Marília    1    3    0    1    2    3    8   -5   11
18.       Noroeste    0    2    0    0    2    1    4   -3    0
19.     Portuguesa    0    2    0    0    2    0    3   -3    0
20.     Mogi Mirim    0    2    0    0    2    0    5   -5    0</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[urban dictionary crawler]]></title>
<link>http://fiorix.wordpress.com/2009/01/08/urban-dictionary-crawler/</link>
<pubDate>Thu, 08 Jan 2009 20:39:53 +0000</pubDate>
<dc:creator>alef</dc:creator>
<guid>http://fiorix.wordpress.com/2009/01/08/urban-dictionary-crawler/</guid>
<description><![CDATA[Algum tempo atrás, o Urban Dictionary teve uma API para que outros sistemas e sites pudessem fazer b]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><img class="alignleft size-medium wp-image-236" style="margin:5px;" title="udlogo2" src="http://fiorix.wordpress.com/files/2009/01/udlogo2.jpg?w=213" alt="udlogo2" width="213" height="300" />Algum tempo atrás, o <a href="http://www.urbandictionary.com/">Urban Dictionary</a> teve uma API para que outros sistemas e sites pudessem fazer busca nos termos de lá, que são muito legais.</p>
<p>Hoje, o link para a API não funciona mais, <a href="http://www.urbandictionary.com/define.php?term=api.php">e virou um termo</a> do dicionário. Procurando no google, encontrei <a href="http://www.programmableweb.com/api/urbandictionary">um link</a> que menciona uma API de acesso via SOAP com WSDL, mas também não é muito útil porque <a href="http://www.urbandictionary.com/blog.php?page=5">segundo o administrador de lá</a> (15 de setembro) o formulário que emite as chaves de acesso está quebrado.</p>
<p>Em outras palavras: não existe mais API pro Urban Dictionary. Apesar a API de SOAP ainda funcionar, não é possível emitir novas chaves e por isso não é possível ter novos usuários com acesso a essa API. Não apenas eu, mas <a href="http://urbandictionary.uservoice.com/pages/general/suggestions/15675">um outro cara</a> gostaria de ter acesso ao Urban Dictionary através de uma API mais simples.</p>
<p>Enquanto eles não se manifestam, fiz um crawler usando <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>, que resolve esse problema de um jeito meio tosco, mas pelo menos é simples e funciona.</p>
<pre class="brush: python;">
#!/usr/bin/env python
# coding: utf-8
# 20080108 AF

import urllib
from BeautifulSoup import BeautifulSoup

def urbandict(search, limit=5, cutafter=256):
    query = urllib.urlencode(dict(term=search))
    url = 'http://www.urbandictionary.com/define.php?' + query
    response = urllib.urlopen(url)
    soup = BeautifulSoup(response.read(), convertEntities='html')

    fix = lambda item: item.replace('\r', '').replace('\n', '')
    cut = lambda item: len(item) &gt; cutafter and item[:cutafter] + '(...)' or item
    extract = lambda item: isinstance(item, unicode) and item or \
        (item.name == 'a' and item.contents[0] or '')

    for item, count in zip(soup.findAll('div', attrs={'class':'definition'}), range(limit)):
        yield cut(' '.join([fix(extract(k)) for k in item.contents])).encode('utf-8')

for term in urbandict('stupid'):
    print term + '\n'
</pre>
<p>O limite padrão para a quantidade de respostas retornadas é 5, embora todas elas sejam capturadas. Graças à organização do site, isso é possível de forma simples pois os resultados são agrupados em DIVs cuja classe é &#8220;definition&#8221;.</p>
<p>Também, as seguintes funções modificam a exposição dos resultados:</p>
<ul>
<li>fix: remove CRLF, e faz com que cada resultado seja uma única linha</li>
<li>cut: corta cada resultado após <em>cutafter</em> bytes, e adiciona (&#8230;)</li>
</ul>
<p>Dessa maneira, fica fácil embutir os resultados do Urban Dictionary em qualquer lugar: um site, um programa no desktop, etc.</p>
<blockquote><p>$ python ud.py<br />
Someone who has to look up &#8220;stupid&#8221; in the dictionary because they don&#8217;t know what it means.</p>
<p>1) George W. Bush.  2) Karl Rove  3) Dick Cheney  4) You get the idea!</p></blockquote>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[BeautifulSoup or SGMLParser Bug]]></title>
<link>http://themindstorms.wordpress.com/2009/01/03/beautifulsoup-or-sgmlparser-bug/</link>
<pubDate>Sat, 03 Jan 2009 11:53:00 +0000</pubDate>
<dc:creator>Alex Popescu (aka the_mindstorm)</dc:creator>
<guid>http://themindstorms.wordpress.com/2009/01/03/beautifulsoup-or-sgmlparser-bug/</guid>
<description><![CDATA[If you are reading this, you already know what BeautifulSoup is and how useful it is while working w]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>
If you are reading this, you already know what <a rel="external" href="http://www.crummy.com/software/BeautifulSoup/" title="BeautifulSoup">BeautifulSoup</a> is and how useful it is while working with XML/HTML in Python (in case you are not familiar with it, I&#8217;d encourage you to read its documentation). So I&#8217;ll just skip to the main reason of this post: <strong>a bug in parsing the &#60;script&#62; tags in HTML documents</strong>.
</p>
<p>
<img src="http://themindstorms.files.wordpress.com/2009/01/101.jpg?w=150&#038;h=150" alt="10.1.jpg" border="0" width="150" height="150" align="right" /><br />
According to the documentation, <strong>BeautifulSoup</strong> knows how to handle the body of a &#60;script&#62; tag, meaning that it knows to treat its content as a pure string and not perform any additional parsing on it. Unfortunately, I&#8217;ve discovered a corner case where it behaves incorrectly.
</p>
<p>Here is the sample HTML that will reveal the bug:</p>
<pre>
&#60;html&#62;
&#60;head&#62;&#60;/head&#62;
&#60;body&#62;
  &#60;script type='text/javascript'&#62;
    document.write('&#60;/script&#62;');
    document.write('&#60;div&#62;&#60;/div&#62;');
  &#60;/script&#62;
&#60;/body&#62;
&#60;/html&#62;
</pre>
<p>
The problem is that the string &#8216;&#60;/script&#62;&#8217; tricks the parser to believe that the end of the &#60;script&#62; tag is reached and so instead of getting a single <code>Tag</code> from the &#60;script&#62; HTML tag it basically results in 2 elements: a <code>Tag</code> and a <code>NavigableString</code> that contains the rest of the &#60;script&#62; tag (i.e. what comes after the &#8216;&#60;/script&#62;&#8217; string: <code>'); document.write('&#60;div&#62;&#60;/div&#62;');</code>).
</p>
<p>
This basically means that for any HTML that contains a similar fragment rewriting it will lead to broken &#60;script&#62;s. Unfortunately, I haven&#8217;t been able to figure out a solution. My impression is that this parsing happens at a very low level and this makes me think that the bug might not be one of <strong>BeatifulSoup</strong> but rather a bug in <strong>SGMLParser</strong>.
</p>
<p><strong>The affected version is 3.0.7a</strong>. Meanwhile it looks like <a rel="external" href="http://www.crummy.com/software/BeautifulSoup/CHANGELOG.html" title="BeautifulSoup 3.1.0">a new release has seen the light</a>, but I haven&#8217;t tested it yet. The new <strong>BeautifulSoup 3.1.0</strong> has replaced the <strong>SGMLParser</strong> with <strong>HTMLParser</strong> (in the attempt to make <strong>BeautifulSoup</strong> compatible with Python 3.0) so this bug might be already fixed.
</p>
<p></p>
<p>
If we are at bugs, I&#8217;d also like to mention one in <strong>Python 2.5.2 MacOS</strong>:
</p>
<pre>
MemoryError
Python(72261) malloc: *** mmap(size=2097152) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Exception exceptions.MemoryError: MemoryError() in  ignored
</pre>
<p>
Things are much simpler with this one, even if the displayed information doesn&#8217;t offer enough details. The above bug is basically the result of <strong>adding strings to a list in an infinite loop</strong> (so a programming problem, but with no indication of the error).
</p>
<hr />
<div align="center">You can contact me on <a rel="external" href="http://www.linkedin.com/in/alexandrup" title="Alex Popescu LinkedIn">Alex Popescu @LinkedIn</a> &#124; <a rel="external" href="http://www.new.facebook.com/profile.php?id=1439085769" title="Alex Popescu Facebook">Alex Popescu @Facebook</a> &#124; <a rel="external" href="http://friendfeed.com/alexpopescu" title="Alex Popescu FriendFeed">Alex Popescu @FriendFeed</a> &#124; <a rel="external" href="http://twitter.com/al3xandru" title="Alex Popescu Twitter">Alex Popescu @Twitter</a></div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Python XML verarbeiten]]></title>
<link>http://codecocktail.wordpress.com/2008/06/29/python-xml-verarbeiten/</link>
<pubDate>Sun, 29 Jun 2008 08:31:18 +0000</pubDate>
<dc:creator>charlysan</dc:creator>
<guid>http://codecocktail.wordpress.com/2008/06/29/python-xml-verarbeiten/</guid>
<description><![CDATA[Python kennt verschiedene Module zum verarbeiten von XML Dokumenten: -sax -xml.dom.minidom -Beautifu]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Python kennt verschiedene Module zum verarbeiten von XML Dokumenten:</p>
<p>-sax</p>
<p>-xml.dom.minidom</p>
<p>-Beautifulsoup</p>
<p>Sax ist sehr komplex und ebenso kompliziert, Beautifulsoup ist vor allem um normale HTML Dateien zu verarbeiten, besonders für APIs dagegen ist xml.dom.minidom geeignet, da es sich durch seine Einfachheit auszeichnet.</p>
<p><em>import urllib2<br />
import xml.dom.minidom<br />
</em></p>
<p><em>#  Lade eine Seite als XML Objekt<br />
</em></p>
<p><em>def getXml(url):<br />
url = url.replace(&#8216; &#8216;,&#8217;%20&#8242;)<br />
return xml.dom.minidom.parseString(urllib2.urlopen(url).read())<br />
</em></p>
<p><em># Lade Seite</em></p>
<p><em>doc = getXml(&#8216;http://example.com/test.xml&#8217;)</em></p>
<p><em>titles = doc.getElementsByTagName(&#8216;title&#8217;)</em></p>
<p><em># Auf Inhalt eines bestimmten Elements zugreifen</em></p>
<p><em>myContent = doc.getElementById(&#8216;content&#8217;).childNodes[0].data</em></p>
<p><em>#Auf alle Inhalte einer Liste von Tags zugreifen</em></p>
<p><em>divs = doc.getElementsByTagName(&#8216;div&#8217;)</em></p>
<p><em>divContents = [div.childNode[0].data for div in divs]</em></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[This one converts a html table into a list of lists using BeautifulSoup]]></title>
<link>http://collincode.wordpress.com/2008/04/30/this-one-converts-a-html-table-into-a-list-of-lists-using-beautifulsoup/</link>
<pubDate>Wed, 30 Apr 2008 21:07:12 +0000</pubDate>
<dc:creator>Collin Anderson</dc:creator>
<guid>http://collincode.wordpress.com/2008/04/30/this-one-converts-a-html-table-into-a-list-of-lists-using-beautifulsoup/</guid>
<description><![CDATA[def removeextraspaces(string): while &#39; &#39; in string: string = string.replace(&#39; &#39;, ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><div class="highlight">
<pre><span style="color:#007020;font-weight:bold;">def</span> <span style="color:#06287e;">removeextraspaces</span>(string):
    <span style="color:#007020;font-weight:bold;">while</span> <span style="color:#4070a0;">&#39;  &#39;</span> <span style="color:#007020;font-weight:bold;">in</span> string:
        string <span style="color:#666666;">=</span> string<span style="color:#666666;">.</span>replace(<span style="color:#4070a0;">&#39;  &#39;</span>, <span style="color:#4070a0;">&#39; &#39;</span>)
    <span style="color:#007020;font-weight:bold;">return</span> string<span style="color:#666666;">.</span>strip()

<span style="color:#007020;font-weight:bold;">def</span> <span style="color:#06287e;">html2text</span>(node):
    <span style="color:#007020;font-weight:bold;">if</span> <span style="color:#007020;font-weight:bold;">not</span> <span style="color:#007020;">hasattr</span>(node, <span style="color:#4070a0;">&#39;contents&#39;</span>):
        <span style="color:#007020;font-weight:bold;">return</span> node<span style="color:#666666;">.</span>replace(<span style="color:#4070a0;">&#39;</span><span style="color:#4070a0;font-weight:bold;">n</span><span style="color:#4070a0;">&#39;</span>, <span style="color:#4070a0;">&#39; &#39;</span>)<span style="color:#666666;">.</span>replace(<span style="color:#4070a0;">&#39;&#38;nbsp;&#39;</span>, <span style="color:#4070a0;">&#39; &#39;</span>)
    <span style="color:#007020;font-weight:bold;">if</span> node<span style="color:#666666;">.</span>isSelfClosing:
        <span style="color:#007020;font-weight:bold;">return</span> <span style="color:#4070a0;">&#39; &#39;</span>
    <span style="color:#007020;font-weight:bold;">return</span> <span style="color:#4070a0;">&#39;&#39;</span><span style="color:#666666;">.</span>join([html2text(x) <span style="color:#007020;font-weight:bold;">for</span> x <span style="color:#007020;font-weight:bold;">in</span> node<span style="color:#666666;">.</span>contents])

<span style="color:#007020;font-weight:bold;">def</span> <span style="color:#06287e;">content</span>(array):
    <span style="color:#007020;font-weight:bold;">return</span> [removeextraspaces(html2text(x)) <span style="color:#007020;font-weight:bold;">for</span> x <span style="color:#007020;font-weight:bold;">in</span> array]

<span style="color:#007020;font-weight:bold;">def</span> <span style="color:#06287e;">table2list</span>(table):
    <span style="color:#007020;font-weight:bold;">return</span> [content(row<span style="color:#666666;">.</span>findChildren(<span style="color:#4070a0;">&#39;td&#39;</span>)) <span style="color:#007020;font-weight:bold;">for</span> row <span style="color:#007020;font-weight:bold;">in</span> table<span style="color:#666666;">.</span>findChildren(<span style="color:#4070a0;">&#39;td&#39;</span>)]
</pre>
</div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[This one scrapes the information from UMN Lookup]]></title>
<link>http://collincode.wordpress.com/2008/04/18/this-one-scrapes-the-information-from-umn-lookup/</link>
<pubDate>Fri, 18 Apr 2008 15:49:48 +0000</pubDate>
<dc:creator>Collin Anderson</dc:creator>
<guid>http://collincode.wordpress.com/2008/04/18/this-one-scrapes-the-information-from-umn-lookup/</guid>
<description><![CDATA[Try it out for yourself at utilitymill. import BeautifulSoup import urllib def html2text(node): if n]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://utilitymill.com/utility/umnlookup">Try it out for yourself at utilitymill.</a></p>
<div class="highlight">
<pre><span style="color:#007020;font-weight:bold;">import</span> <span style="color:#0e84b5;font-weight:bold;">BeautifulSoup</span>
<span style="color:#007020;font-weight:bold;">import</span> <span style="color:#0e84b5;font-weight:bold;">urllib</span>

<span style="color:#007020;font-weight:bold;">def</span> <span style="color:#06287e;">html2text</span>(node):
    <span style="color:#007020;font-weight:bold;">if</span> <span style="color:#007020;font-weight:bold;">not</span> <span style="color:#007020;">hasattr</span>(node, <span style="color:#4070a0;">&#39;contents&#39;</span>):
        <span style="color:#007020;font-weight:bold;">return</span> node<span style="color:#666666;">.</span>replace(<span style="color:#4070a0;">&#39;</span><span style="color:#4070a0;font-weight:bold;">n</span><span style="color:#4070a0;">&#39;</span>, <span style="color:#4070a0;">&#39; &#39;</span>)
    <span style="color:#007020;font-weight:bold;">if</span> node<span style="color:#666666;">.</span>isSelfClosing:
        <span style="color:#007020;font-weight:bold;">return</span> <span style="color:#4070a0;">&#39; &#39;</span>
    <span style="color:#007020;font-weight:bold;">return</span> <span style="color:#4070a0;">&#39;&#39;</span><span style="color:#666666;">.</span>join([html2text(x) <span style="color:#007020;font-weight:bold;">for</span> x <span style="color:#007020;font-weight:bold;">in</span> node<span style="color:#666666;">.</span>contents])

<span style="color:#007020;font-weight:bold;">def</span> <span style="color:#06287e;">lookup</span>(username):
    html <span style="color:#666666;">=</span> urllib<span style="color:#666666;">.</span>urlopen(<span style="color:#4070a0;">&#34;http://umn.edu/lookup?UID=&#34;</span> <span style="color:#666666;">+</span> username)<span style="color:#666666;">.</span>read()
    soup <span style="color:#666666;">=</span> BeautifulSoup<span style="color:#666666;">.</span>BeautifulSoup(html)
    data <span style="color:#666666;">=</span> {}
    <span style="color:#007020;font-weight:bold;">for</span> heading <span style="color:#007020;font-weight:bold;">in</span> soup(<span style="color:#4070a0;">&#39;th&#39;</span>):
        key <span style="color:#666666;">=</span> heading<span style="color:#666666;">.</span>contents[<span style="color:#40a070;">0</span>][:<span style="color:#666666;">-</span><span style="color:#40a070;">1</span>]
        val <span style="color:#666666;">=</span> html2text(heading<span style="color:#666666;">.</span>findNext())
        data[key] <span style="color:#666666;">=</span> val
    <span style="color:#007020;font-weight:bold;">return</span> data
</pre>
</div>
</div>]]></content:encoded>
</item>

</channel>
</rss>
