<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>couchdb &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/couchdb/</link>
	<description>Feed of posts on WordPress.com tagged "couchdb"</description>
	<pubDate>Sun, 29 Nov 2009 22:21:15 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[NoSQL, qu'est ce que c'est ?]]></title>
<link>http://nonsql.wordpress.com/2009/11/28/nosql-non-sql-definition/</link>
<pubDate>Sat, 28 Nov 2009 01:14:42 +0000</pubDate>
<dc:creator>cpierret</dc:creator>
<guid>http://nonsql.wordpress.com/2009/11/28/nosql-non-sql-definition/</guid>
<description><![CDATA[NoSQL est le terme anglais sous lequel sont souvent désignés les bases de données émergentes général]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a title="NoSQL sur wikipédia" href="http://fr.wikipedia.org/wiki/NoSQL">NoSQL</a> est le terme anglais sous lequel sont souvent désignés les bases de données émergentes généralement non relationnelles et donc n&#8217;utilisant pas le <a title="langage SQL" href="http://fr.wikipedia.org/wiki/SQL">langage SQL</a> pour faire des requêtes sur la base de donnée.</p>
<p>Les bases de données relationnelles sont aujourd&#8217;hui essentiellement utilisées dans les entreprises pour stocker les informations critiques sur le business.</p>
<p>J&#8217;ai passé plusieurs années à écrire des <a title="SAP BusinessObjects WebIntelligence" href="http://www.sap.com/france/solutions/sapbusinessobjects/large/intelligenceplatform/bi/qra/web_intelligence/index.epx">logiciels pour permette de faciliter l&#8217;accès à ces données par des non experts</a> , des utilisateurs normaux non doués de pouvoirs paranormaux pour intérpréter l&#8217;algèbre relationnelle et le SQL.  Je faisais déja du NoSQL, mais uniquement pour les utilisateurs &#8230; Sous le capot, il y avait du SQL à gogo.</p>
<p>J&#8217;ai eu l&#8217;opportunité d&#8217;aller à la conférence Open Source <a title="OSCON 2009" href="http://en.oreilly.com/oscon2009">OSCON 2009 d&#8217;O'Reilly</a> et le sujet des bases de données NoSQL était vraiment un sujet très chaud.</p>
<p>Saviez-vous que <a title="Voldemort, le méchant sorcier" href="http://fr.wikipedia.org/wiki/Voldemort">Voldemort</a> n&#8217;est pas seulement un personnage de Harry Potter ? Que <a title="Cassandre, Mythologie" href="http://fr.wikipedia.org/wiki/Cassandre_%28mythologie%29">Cassandra</a> ne faisait pas que prédire l&#8217;avenir en Grèce ? Que <a title="Hadoop" href="http://hadoop.apache.org/">Hadoop</a> (avec <a title="Hadoop HBase" href="http://wiki.apache.org/hadoop/Hbase">HBase</a>) n&#8217;est pas le bruit que fait une boisson gazeuse mal digérée ? Que <a title="Relaxez vous avec CouchDB" href="http://www.couchdb-fr.net/">CouchDB</a> permettait de se relaxer au niveau du schéma ?</p>
<p>Au commencement, il y eu le verbe, avec un papier de recherche publié par Google qui a fait date:</p>
<p><a title="BigTable par Google" href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.9822"><strong>Bigtable: A distributed storage system for structured data (2006)</strong></a> par Fay Chang                                                ,                                                                                    Jeffrey Dean                                                ,                                                                                    Sanjay Ghemawat                                                ,                                                                                    Wilson C. Hsieh                                                ,                                                                                    Deborah A. Wallach                                                ,                                                                                    Mike Burrows                                                ,                                                                                    Tushar Chandra                                                ,                                                                                    Andrew Fikes                                                ,                                                                                    Robert E. Gruber</p>
<p>Ce papier décrit la base de donnée non relationnelle (NoSQL donc) utilisée par Google pour stocker les données du moteur de recherche, de Google Earth et Google Finance.  L&#8217;échelle de grandeur de la quantité de données stockée donne le vertige, on parle en petaoctets (ça fait beaucoup de gigaoctets tout ça!).</p>
<p>Celà a donné plein d&#8217;idées à d&#8217;autres et une pléthore de bases NoSQL ont vu naissance dans la foulée.  Je tiens de source sûre (un architecte de Yahoo en Californie) que la sortie de Hadoop et HBase en Open Source par Yahoo! a été poussée par l&#8217;idée de permettre à tout le monde de faire la même chose que Google avec BigTable et Map/Reduce&#8230;</p>
<p>Je ne vous ai pas encore parlé du <a title="CAP Theorem" href="http://camelcase.blogspot.com/2007/08/cap-theorem.html">théorème CAP</a> qui veut qu&#8217;entre Cohérence,Disponibilité (Availability en Anglais) et tolérance à la partition/découpage du système, on ne puisse en obtenir que deux.  C&#8217;est un peu la théorie des quantas de l&#8217;informatique &#8230;</p>
<p>Vous pouvez voir une <a title="Werner Vogel, Amazon CTO on consistency and availability" href="http://www.infoq.com/presentations/availability-consistency">vidéo en Anglais du directeur technique de Amazon qui explique les enjeux dans l&#8217;utilisation de ces nouvelles bases de données</a>.</p>
<p>Si vous êtes arrivé jusqu&#8217;à la fin de cet article et que vous n&#8217;avez pas encore mal à la tête, il est temps de lire l&#8217;article sur BigTable <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[CouchDB Replication Monitor]]></title>
<link>http://knuthellan.com/2009/11/24/couchdb-replication-monitor/</link>
<pubDate>Tue, 24 Nov 2009 11:52:44 +0000</pubDate>
<dc:creator>knuthellan</dc:creator>
<guid>http://knuthellan.com/2009/11/24/couchdb-replication-monitor/</guid>
<description><![CDATA[CouchDB does replication, but replication needs to be set up after each server restart. This means y]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>CouchDB does replication, but replication needs to be set up after each server restart. This means you need to ensure that replication is restarted whenever the daemon restarts CouchDB. I have never seen replication stop working without a restart, but I prefer being safe to being sorry about replication. To be perfectly honest, I do not trust that my replication initiation after a soft CouchDB restart works properly either so I prefer to monitor the replication and have a safety mechanism in place to restart replication if needed.</p>
<p>There are several ways to monitor replication. You could fetch the status page of all servers and restart replication on servers with an empty page, but that is a kind of brute force approach in my world. A better solution is to use the replication itself to monitor that it works. </p>
<p>Each server updates their timestamp in CouchDB and this is again replicated to the other servers. This gets us a bit of the way, but not all the way. The server you are checking might have received updates from all the other servers, but you don&#8217;t know if it&#8217;s pushed out anything to the other servers. To solve this, you can add information about the other servers to the local server as well. This will give you a matrix of server replication status.</p>
<p>For each server, you will see the timestamp replicated from the server and a list of timestamps replicated to that server. The latter often being a generation older than the former. Cron can be used to update this data. The cronjob reads all the server timestamps and updates this servers timestamp followed by a list of the other servers timestamp.</p>
<p>A mapper to get a server id to server status out of the db.<br />
<code></p>
<pre>
map: function(doc) {
  emit(doc._id, doc);
}
</pre>
<p></code></p>
<p>Our monitroing database is called server_status. The design containing the mapper is called collections and the view server_list.</p>
<p>A Ruby database checker that can run on cron.<br />
<code></p>
<pre>
require 'rubygems'
require 'couchrest'
require 'json'
require 'open-uri'

STATUS_DB = 'http://localhost:5984/server_status'
COLLECTIONS = 'collections'
SERVER_LIST = 'server_list'

hostname = ARGV[0]

status_db = CouchRest.database!(STATUS_DB)
status_view = "#{STATUS_DB}/_design/#{COLLECTIONS}/_view/#{SERVER_LIST}"

# Get the current information about this server if available
server_status = begin
  status_db.get(hostname)
rescue RestClient::ResourceNotFound
  {'_id' =&#62; hostname}
end

server_status['time'] = Time.new.to_i
# Get the current times of the other servers and update this server's view of them
JSON(open(status_view).read)['rows'].map do &#124;row&#124;
  {'server' =&#62; row['id'], 'status' =&#62; row['value']}
end.each do &#124;status&#124;
  server_status['servers'][status['server']] = status['status']['time'] unless status['server'] == hostname
end
status_db.save_doc(server_status)
</pre>
<p></code></p>
<p>Now you need to determine when to trigger replication restart. This can be handled in the watchdog cronjob. If the highest timestamp seen for this server at other servers is above a threshold, restart replication.</p>
<p>The final loop triggering when the age is above a threshold. The init_replication method just posts a continuous replication trigger to the db:<br />
<code></p>
<pre>
JSON(open(status_view).read)['rows'].map do &#124;row&#124;
  {'server' =&#62; row['id'], 'status' =&#62; row['value']}
end.each do &#124;status&#124;
  init_replication(status['server']) if server_status['time'] - status['status']['time'] &#62; THRESHOLD
  server_status['servers'][status['server']] = status['status']['time'] unless status['server'] == hostname
end
</pre>
<p></code></p>
<p>Rudimentary init_replication method.<br />
<code></p>
<pre>
def init_replication(server)
  target = "http://#{server}:5984"
  databases = ['server_status']
  databases.each do &#124;db&#124;
    config = {
            'source' =&#62; "#{db}",
            'target' =&#62; "#{target}/#{db}",
            'continuous' =&#62; true
    }
    payload = JSON.generate(config)
    result = Net::HTTP.new('127.0.0.1', '5984').post('/_replicate', payload, {'content-type' =&#62; 'text/x-json'})
    p "replication to #{target}/#{db} failed with #{result.code}" unless result.code == 200
  end
end
</pre>
<p></code></p>
<p>We have a monitoring view of replication ages in our system. It shows the matrix of timestamps as age in seconds rather than the actual timestamp since the age is the important metric.<br />
<a href="http://knuthellan.wordpress.com/files/2009/11/server_status.jpg"><img src="http://knuthellan.wordpress.com/files/2009/11/server_status.jpg?w=300" alt="Server Status" title="Server Status" width="300" height="104" class="alignnone size-medium wp-image-250" /></a></p>
<p>A bonus of this replication monitoring system is that we can access the status page from a mobil phone and get an accurate picture of the replication status. This doesn&#8217;t worry me now, but it did when we first set it up. Now it&#8217;s just a part of our general monitoring view.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Podcast #002: We don't need no steenking editing!]]></title>
<link>http://strictlyprofessional.wordpress.com/2009/11/23/podcast-002-we-dont-need-no-steenking-editing/</link>
<pubDate>Mon, 23 Nov 2009 19:08:13 +0000</pubDate>
<dc:creator>Chas Emerick</dc:creator>
<guid>http://strictlyprofessional.wordpress.com/2009/11/23/podcast-002-we-dont-need-no-steenking-editing/</guid>
<description><![CDATA[A second podcast! So, I reckon we&#8217;re already in the 75th percentile of podcasts, just by doing]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>A second podcast! So, I reckon we&#8217;re already in the 75th percentile of podcasts, just by doing more than one &#8212; 90% of life is showing up, etc etc.</p>
<p>Recorded on 11/19/2009, this episode features myself, Chris Miles, Joe Brandt, Michael McIntosh, and Michael Klatsky (see links to people&#8217;s sites, etc. in the sidebar).  We talked about a smattering of things related to &#8220;the cloud&#8221;, IT management, <a href="http://aws.amazon.com/">Amazon AWS</a>, <a href="http://rackspace.com">Rackspace</a>, <a href="http://couchdb.apache.org/">CouchDB</a>, <a href="http://code.google.com/p/redis/">Redis</a>, and other bits.  A couple of topical highlights (mostly in order):</p>
<ul>
<li>M. Klatsky was kind enough to bring some of his homemade chocolate coffee stout, which we all enjoyed quite a lot.</li>
<li>Chas described Snowtide&#8217;s in-progress move away from any shade of in-house hosting to using Amazon AWS.</li>
<li>We talked quite a bit about Amazon&#8217;s (relatively) new <a href="http://aws.amazon.com/rds/">Relational Database Service</a>, and how the real value of &#8220;the cloud&#8221; may not be the outsourcing of infrastructure management, but the value-added services like RDS that take a ton of busybody work out of the mix for companies for whom the related &#8220;bare metal&#8221; services are simply cost sinks.</li>
<li>There was a fairly extended discussion of the mechanics of managing cloud nodes, specifically Amazon EC2 instances, along with the security issues surrounding AWS&#8217; specific authentication mechanisms (e.g. keys, certs, etc). <strong>Update:</strong> Here&#8217;s a <a href="http://clouddevelopertips.blogspot.com/2009/08/how-to-keep-your-aws-credentials-on-ec2.html">good rundown of EC2 key management strategies</a>, focusing on the security ramifications of each.</li>
<li>Some talk about <a href="http://www.google.com/search?q=cloud+federation">cloud federation</a>, cloud service lock-in, and maybe how things like <a href="http://open.eucalyptus.com/">Eucalyptus</a> (an open-source reimplementation of (parts of?) AWS&#8217; cloud APIs) might be a way to mitigate the consequences of cloud vendor lock-in.</li>
<li>Man, what happened to the Sun and IBM clouds?</li>
<li>Everyone seems to agree that Rackspace is somewhat out of touch w.r.t. &#8220;the cloud&#8221;, and have fallen behind Amazon AWS technologically.</li>
<li>How do you prepare for system failure and disaster recovery in the cloud?  Whoa, it&#8217;s a lot easier to test recovery in the cloud than in other environments.  Also, mobile self-contained server rooms, and the pain of depending on unresponsive commercial backup software vendors.</li>
<li>M. Klatsky reveals the origin of his <em>mapu</em> handle for all to know!</li>
<li>Miles has been checking out Redis as a zero admin message queue</li>
<li>Chas likes CouchDB because of its pleasant programming model and rock-solid real-time multi-master replication (in contrast to mysql replication, which is touchy at best)</li>
<li>We all hate the <em>nosql</em> moniker &#8212; catchy, but too easily interpreted as confrontational.</li>
<li>&#8220;Functional programming lends itself for deploying in an elastic, distributed infrastructure&#8230;&#8221; &#8211; Miles</li>
<li>Single sign on options, including Joe&#8217;s shop rolling their own (&#8220;Ouch!&#8221; &#8211; Chas), OAuth, OpenID, and <a href="http://shibboleth.internet2.edu/">Shibboleth</a>, etc.</li>
<li>Miles wrote an <a href="http://github.com/cmiles74/Parakeet">emacs twitter client</a> (called Parakeet) <a href="http://twitch.posterous.com/my-first-blog-post-9191">some time ago</a>.  (TweetDeck, watch out! <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> )</li>
<li>We&#8217;re all REST fanboys, apparently.  Although, if you have a toolchain that deeply supports it, SOAP <em>can</em> be pleasant.</li>
<li>There appears to be a <a href="http://tech.groups.yahoo.com/group/rest-discuss/message/6735">debate</a> out there between &#8220;the purists&#8221; that consider REST to be only what was described in the <a href="http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm">original paper that coined the term</a>, and most of the rest of the world that uses the term &#8220;REST&#8221; to describe any protocol that is stateless and uses HTTP for transport (e.g. not necessarily including URLs for additional operations on a resource, etc).</li>
<li><span style="text-decoration:line-through;">Finally, this episode includes the first Strictly Professional easter egg. Maybe it was funnier in person.</span></li>
</ul>
<p>BTW, it turns out that I was totally off-base w.r.t. a couple of facets of CouchDB:</p>
<ol>
<li>We&#8217;ve never had to think about CouchDB&#8217;s versioning at all, so my description of how it works was simply wrong.  Sorry, folks.  Check out the <a href="http://books.couchdb.org/relax/reference/conflict-management">chapter of the CouchDB book versioning and conflict management</a> for the real skinny.</li>
<li>Further, I had some wires crossed in my head when I said that CouchDB had an option for using a binary protocol like stomp.  CouchDB only supports JSON over HTTP, with binary attachments encoded as you might otherwise do for an HTTP request.</li>
</ol>
<p>I hope people continue to enjoy the material.  If you have any comments, or questions for us, feel free to leave them, and we&#8217;ll see about addressing them in future episodes.</p>
<p><span style='text-align:left;display:block;'><p><object type='application/x-shockwave-flash' data='http://wordpress.com/wp-content/plugins/audio-player/player.swf' width='290' height='24' id='audioplayer1'><param name='movie' value='http://wordpress.com/wp-content/plugins/audio-player/player.swf' /><param name='FlashVars' value='&amp;bg=0xf8f8f8&amp;leftbg=0xeeeeee&amp;lefticon=0x666666&amp;rightbg=0xcccccc&amp;rightbghover=0x999999&amp;righticon=0x666666&amp;righticonhover=0xffffff&amp;text=0x666666&amp;slider=0x666666&amp;track=0xFFFFFF&amp;border=0x666666&amp;loader=0x9FFFB8&amp;soundFile=http%3A%2F%2Fs3.amazonaws.com%2Fstrictly-professional%2Fsp-podcast-002.mp3' /><param name='quality' value='high' /><param name='menu' value='false' /><param name='bgcolor' value='#FFFFFF' /></object></p></span></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Who needs a Relational Database?]]></title>
<link>http://aaronweiker.com/2009/11/22/who-needs-a-relational-database/</link>
<pubDate>Mon, 23 Nov 2009 02:53:30 +0000</pubDate>
<dc:creator>Aaron Weiker</dc:creator>
<guid>http://aaronweiker.com/2009/11/22/who-needs-a-relational-database/</guid>
<description><![CDATA[Ever since working on IMM I have enjoyed the fact that I haven&#8217;t needed to design a database o]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Ever since working on IMM I have enjoyed the fact that I haven&#8217;t needed to design a database or write any data access code. Having started working on a small side project I was in a situation where I needed to persist data. My first reaction was to use a simple relational database so I started modeling my domain objects in SQL. I soon realized that I was violating <a href="http://en.wikipedia.org/wiki/DRY">DRY</a>. I thought I could try using <a href="http://nhforge.org/">nHibernate</a> but that would still require building a database model. I instead decided to go with <a href="http://www.microsoft.com/windowsazure/windowsazure/">Windows Azure Table Storage</a> but ended up having trouble getting it to work. Instead I started looking for something with a nice REST interface. I came across <a href="http://aws.amazon.com/simpledb/">Amazon SimpleDB</a> but decided against it, for now. Instead I decided to go with <a href="http://couchdb.apache.org/">CouchDB</a>. The reason I did this is because I don&#8217;t need to worry about defining a schema and access is very fast.</p>
<p>Now being completely honest here, I didn&#8217;t stay on CouchDB as I was hoping to hand off the project t some other people at Microsoft so I ported it <a href="http://msdn.microsoft.com/en-us/library/aa697427(VS.80).aspx">Entity Framework 1.0</a>. After doing this port I found that I had to destroy my domain model and I had to constantly fight for any level of purity. After a while of fighting this I decided to try out <a href="http://nhforge.org/">nHibernate</a> after watching <a href="http://tekpub.com/preview/nhibernate">Ayende teach Rob Conery how to use it</a>. Midway through the first episode I had it working.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Open Source DBMSes are ready to go?]]></title>
<link>http://kadenzercourant.wordpress.com/2009/11/15/open-source-dbmses-are-ready-to-go/</link>
<pubDate>Sun, 15 Nov 2009 21:34:41 +0000</pubDate>
<dc:creator>demakelaar</dc:creator>
<guid>http://kadenzercourant.wordpress.com/2009/11/15/open-source-dbmses-are-ready-to-go/</guid>
<description><![CDATA[Het is feest in open source database land. De laatste tijd ontstaan er steeds meer alternatieven. My]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><img class="alignright size-thumbnail wp-image-69" title="open source" src="http://kadenzercourant.wordpress.com/files/2009/11/open-source1.png?w=150" alt="open source" width="150" height="129" />Het is feest in open source database land. De laatste tijd ontstaan er steeds meer alternatieven. MySQL, PostgeSQL en FireBird zijn er al een tijdje, maar nieuwe sterren aan het firmament dienen zich aan. Ik noem er maar een paar: <a href="http://www.monetdb.nl/">MonetDB </a>(de column-based open source database, een mooi Nederlands product), <a href="http://couchdb.apache.org/">CouchDB </a>(document georienteerd) en <a href="http://luciddb.org/">LucidDB </a>(speciaal getarget op DWH en BI).</p>
<p>Het interessante is nu dat  de open source markt zelfs winstgevend wordt; er is klaarblijkelijk steeds meer geld mee te verdienen. Althans, dit is de conclusie die ik trek als ik af ga op de bedrijven die professionele ondersteuning bieden op open source databases. Een paar voorbeelden:</p>
<p>MySQL AB doet dit kunstje al sinds jaren. Inmiddels is MySQL van Sun en Sun van Oracle, maar er is een nieuw bedrijf gesignaleerd dat ook professionele ondersteuning gaat bieden: <a href="http://askmonty.org/wiki/index.php/About_Us">Monty</a>.</p>
<p>Ingres zelf biedt <a href="http://www.ingres.com/services/operational-services.php">24&#215;7</a> ondersteuning op haar eigen open source database (hoewel voormalig closed source).</p>
<p>En vandaag <a href="http://www.b-eye-network.com/blogs/adrian/archives/2009/11/dynamodb_is_the.php">lees </a>ik dat LucidDB ogenschijnlijk ook volwassen wordt, nu DynamoDB een commerciele versie gaat leveren.</p>
<p>Zijn dit incidenten of lijkt het toch structureel te worden? Wie gebruikt er al open source databases voor DWH&#8217;s?</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[CouchDB and your life]]></title>
<link>http://jugglingbits.wordpress.com/2009/11/12/couchdb-and-your-life/</link>
<pubDate>Thu, 12 Nov 2009 22:39:26 +0000</pubDate>
<dc:creator>thomas11</dc:creator>
<guid>http://jugglingbits.wordpress.com/2009/11/12/couchdb-and-your-life/</guid>
<description><![CDATA[I just watched a moving talk, Damien Katz&#8217;s CouchDB and Me. It&#8217;s about Erlang and financ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I just watched a moving talk, <a href="http://damienkatz.net/">Damien Katz</a>&#8217;s <a href="http://http://www.infoq.com/presentations/katz-couchdb-and-me">CouchDB and Me</a>. It&#8217;s about Erlang and financial security, family and distributed databases, being a good programmer and being unemployed, dreams and bureaucrats at IBM.  </p>
<p>At the Rubyfringe conference, Katz gave a personal account about how <a href="http://couchdb.apache.org/">CouchDB</a> came to be. Programming and some technical aspects naturally play a role, but he mainly talks about the personal decisions he had to make. Giving up your steady income and social position to write cool open source software &#8211; why would you do that? He did it and made it from feeling like an unemployed loser to being paid to work on his project, on his own terms, by IBM. He comes across as being really open about his experiences and his decisions, and I certainly took away a couple interesting insights. Find out what interests you and just start doing it, learning what you need. It&#8217;s ok to be driven by the idea of having an interesting story to tell; after all, the story of your life will be what you did and experienced. These are just two out of many things that stuck.</p>
<p>I came across this talk reading a <a href="http://news.ycombinator.com/item?id=937430">discussion on HN</a>. Like Katz&#8217;s talk, it&#8217;s kinda about programming, but really about life. Life as a programmer, maybe; but maybe about everyone. It makes me happy to see these discussions. I&#8217;m into programming not because I want to live in a cold world of machines, but because computers challenge your mind, because you can create something with them, even on your own, and maybe even make a living out of it. It&#8217;s about you in the end, not about the machine.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[RJSONIO to process CouchDB output]]></title>
<link>http://contourline.wordpress.com/2009/11/11/rjsonio-to-process-couchdb-output/</link>
<pubDate>Wed, 11 Nov 2009 22:23:03 +0000</pubDate>
<dc:creator>jmarca</dc:creator>
<guid>http://contourline.wordpress.com/2009/11/11/rjsonio-to-process-couchdb-output/</guid>
<description><![CDATA[I have an idea.  I am going to process the 5 minute aggregates of raw detector data I&#8217;ve store]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I have an idea.  I am going to process the 5 minute aggregates of raw detector data I&#8217;ve stored in monthly CouchDB databases using R via Rcurl and RJSONIO.  So, even though my data is split into months physically, I can use Rcurl to pull from each of the databases, and then use RJSONIO to parse the json, then use bootstrap methods to estimate the expected value and confidence bounds, and perhaps more importantly, try to estimate outliers and unusual events.   <!--more-->   </p>
<p>Update, this works great.  Except it reveals that my JSON structure in CouchDB isn&#8217;t so great.  The problem is that I&#8217;m dumping JSON objects per line.  For example:</p>
<pre><strong><code> ["1201044", "00:00:00", "Fri", "12"]:{N:8,O:0.001782, Pct:1, lanes: 5, intrvls: 10}</code></strong></pre>
<p>While that looks great on paper, and logically makes sense if you think about pulling a single record, it doesn&#8217;t work so well when you process lots of records.  While RJSONIO is pretty darn good, it certainly isn&#8217;t a mind reader, and it cannot turn a list of such objects into a matrix or data frame without some help.  If you just throw the results of the RCurl fetch at RJSONIO, you get the following:</p>
<p><code><br />
&#62; demo=fromJSON(data)<br />
&#62; demo$rows[1]<br />
[[1]]<br />
[[1]]$key<br />
[1] "1202024"  "17:35:00" "Fri"      "12" </code></p>
<p>[[1]]$value<br />
[[1]]$value$N<br />
[1] 427</p>
<p>[[1]]$value$O<br />
[1] 0.04861833</p>
<p>[[1]]$value$Pct<br />
[1] 1</p>
<p>[[1]]$value$lanes<br />
[1] 6</p>
<p>[[1]]$value$intrvls<br />
[1] 10</p>
<p>&#160;</p>
<p>In words, what that means is that the CouchDB response of <code>{rows:[...]}</code> is parsed as a labeled list by R, so the response is a list with one element, <code>rows</code>, which contains <code>n</code> elements each with an element <code>key</code> which is a list of character vectors, and another element <code>value</code>, which itself is a list containing several named elements <code>N, O, Pct, lanes, intrvls</code>.  I couldn&#8217;t figure out a quick way to make R figure out that I wanted a <code>data.frame</code> with named entries for each of the key terms and each of the value terms (9 columns by n rows).  Many more gray hairs later, I remembered about <code>unlist</code> and got stuff sorted.  Here is my suboptimal R script for the next time I take a long break from using R and can&#8217;t remember the syntax anymore.</p>
<pre><code>
#parameters: month,id,fivemin
id=1202024  ## randomly chosen
fivemin="17:35"
# get every month in parallel.  RCurl is cool that way
month=c("01","02","03","04","05","06","07","08","09","10","11","12")
couchdb = "http://localhost:5984/"
db = paste("d12_2007_",month,"morehash/_design/summary/_view/fivemin?",sep="")
moreurl = paste("group=true&#38;startkey=[\"",id,"\",\"",fivemin,":00\"]&#38;endkey=[\"",id,"\",\"",fivemin,":01\"]",sep="")
uri=paste(couchdb,db,moreurl,sep="");  ## 12 different URIs to fetch
data = getURL(uri)
## make a list to store data temporarily on the first pass
d1=list()
for(i in 1:length(data)){
  ## parse each month in turn
  jsondata = fromJSON(data[[i]])
  ## unlist flattens the R object
  d1[[i]]=unlist(jsondata$rows)
}
## make the list of flattened R objects into a matrix
## by unlisting again, and specifying that I'm expecting 9 columns
dmatrix = matrix(data=unlist(d1),ncol=9,byrow=TRUE)
## finally, make a dataframe explicitly labeling each column as needed and converting to numeric from text
d2= data.frame(id=dmatrix[,1],
                      tod=dmatrix[,2],
                      dow=dmatrix[,3],
                      dom=as.numeric(dmatrix[,4]),
                      N=as.numeric(dmatrix[,5]),
                      O=as.numeric(dmatrix[,6]),
                      pct=as.numeric(dmatrix[,7]),
                      lanes=as.numeric(dmatrix[,8]),
                      intervals=as.numeric(dmatrix[,9]))
</code>
</pre>
<p>Next up is the actual bootstrapping of interesting statistics.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Tokyo Tyrant Throwing a Tantrum]]></title>
<link>http://contourline.wordpress.com/2009/11/02/tokyo-tyrant-throwing-a-tantrum/</link>
<pubDate>Tue, 03 Nov 2009 05:32:04 +0000</pubDate>
<dc:creator>jmarca</dc:creator>
<guid>http://contourline.wordpress.com/2009/11/02/tokyo-tyrant-throwing-a-tantrum/</guid>
<description><![CDATA[Well, last Friday I posted &#8220;So, slotting 4 months of data away.  I’ll check it again on Monday]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Well, last Friday I posted &#8220;So, slotting 4 months of data away.  I’ll check it again on Monday and see if it worked.&#8221;</p>
<p>It didn&#8217;t.  Actually I checked later that same day and all of my jobs had died due to recv errors.  I&#8217;ve tried lots of hacky things but nothing seems to do the trick.  From some Google searching, it seems that perhaps it is a timeout issue, but I can&#8217;t see how to modify the perl library to allow for a longer timeout.</p>
<p>So, I wrote a little hackity hack thing to stop writing for 5 seconds, make a new connection, and go on writing.  Now it only crashes out of the loop if that new connector also fails to write.  And I also don&#8217;t crash until I save my place in the CSV file, so I don&#8217;t repeat myself.  So  I&#8217;m not getting a complete failure, but it is still super slow.</p>
<p>While the documentation for Tokyo Tyrant and Tokyo Cabinet is super great, it seems to be thin on documentation and use cases/examples for stuffing a lot of data into the table db at once.</p>
<p>Interesting probably unrelated fact.  The crashing only started when I recomputed my target bnum, and boosted it from 8 million to 480 million.</p>
<p>Anyway, I had time today to tweak the data load script, and also to finalize my CouchDB loading script.  Having started two jobs each, and with tokyo tyrant started first, it looks like couchdb is going to finish first (The January job is running three days completed to every one in Tokyo Tyrant job;  the March jobs are closer together, but that Tyrant job started about an hour before everything else).</p>
<p>I guess there is still a way for Tokyo Tyrant to win this race.  I am planning to set up a map/reduce type of view on my CouchDB datastore to collect hourly summaries of the data.  It might be that computing that view is slow, and that computing similar summaries on the Tokyo Cabinet table is faster.  We&#8217;ll see.</p>
<p>&#160;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Sincerial Launched]]></title>
<link>http://knuthellan.com/2009/11/02/sincerial-launched/</link>
<pubDate>Mon, 02 Nov 2009 14:25:56 +0000</pubDate>
<dc:creator>knuthellan</dc:creator>
<guid>http://knuthellan.com/2009/11/02/sincerial-launched/</guid>
<description><![CDATA[Since we launched Sincerial with one connected online store, fundies.no on Thursday, I feel it]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Since we launched <a href="http://www.sincerial.com/">Sincerial</a> with one connected online store, <a href="http://www.fundies.no/">fundies.no</a> on Thursday, I feel it&#8217;s time to go through the system and the choices we made on our way to launch.</p>
<p><strong>Hosting</strong><br />
The obvious and easy choice was <em>Amazon Web Services (AWS) Elastic Compute Cloud (EC2)</em>. A hosting service such as <em>EC2</em> allows a lot of flexibility in server solutions including quick ramp up if the luxury problem of needing more servers occurs. Freedom to choose operating system and providing a virtual server resembling what you get from a traditional hosting services were also important for us. There were alternatives, traditional hosting services where you book physical servers would incur a larger fixed cost than we wanted and at the same time, we would have lost the flexibility. We currently fire up servers and test things in no time just to shut those servers down after the test. With a traditional hosting provider, we would need spare servers for testing or used the live servers.<em> Windows Azure</em> could have been an alternative, but it&#8217;s just a <em>.NET Windows</em> hosting environment and not flexible enough. I feel I should mention Google App Engine as well, but we didn&#8217;t really consider it since it&#8217;s too restricted and even more locked in than <em>Windows Azure</em>.</p>
<p><strong>Operating System</strong><br />
We decided to use <em>CentOS</em> for the servers. The reason for this was favorable experience with <em>Fedora</em> over the last year and <em>Ubuntu</em> getting more painful at the same time.<em> Red Hat Enterprise Linux (RHEL)</em> was a contender, but <em>CentOS</em> provides us with the part of <em>RHEL</em> that we need without all the parts that we would pay for, but not use. <em>Debian</em> would have been our choice a few years back, but the <em>Fedora</em> and <em>Ubuntu</em> experiences over the last year brought us down on the <em>Fedora</em>, <em>RHEL</em>, <em>CentOS</em> side and as mentioned, <em>CentOS</em> was the best fit among those three. <em>Windows</em> wasn&#8217;t considered. We had absolutely no need for anything <em>Windows</em> in there and remote management and configuration of <em>Linux</em> systems is so much easier while being extremely stable. The <em>Amazon Machine Image (AMI)</em> we use was created by <a href="http://www.rightscale.com/">RightScale</a>.</p>
<p><strong>Web Serving Environment</strong><br />
We chose <em>Apache</em> because that&#8217;s what we&#8217;ve used in the past and <em>Apache</em> is solid and dependable. The downside of <em>Apache</em> is that it&#8217;s a big beast that might be overkill for our need. An alternative that we looked a bit at is <em>nginx</em>, but we didn&#8217;t want to spend time on that before launch. We will however look more closely at <em>nginx</em> in the future.</p>
<p><strong>Progamming Language</strong><br />
I am a rubyist and we chose <em>Ruby</em>. The first time I tried <em>Ruby</em>, I wrote a few scripts that I would normally have written in <em>Perl</em>. These weren&#8217;t big scripts, but far from one-liners. They downloaded some content and analyzed it. Writing the scripts in <em>Ruby</em> took me a bit longer than it would have taken writing them in <em>Perl</em>, but they worked right away. In <em>Perl</em> there is always something wrong unless you have a one-liner. The strongest alternative was <em>Python</em> and between <em>Ruby</em> and <em>Python</em> it comes down to taste. The first readability I got as a Googler was in <em>Python</em> and I did write a lot of <em>Python</em>, but it still doesn&#8217;t feel right. I was very happy when I got a chance to sneak myself to a <em>Ruby</em> readability. <em>Java</em> was sort of not considered, but would have been the choice if the lightweight short time-to-market alternatives hadn&#8217;t been available. We use<em> Phusion Passenger</em> to serve a combination of pure <em>Ruby</em> racks and <em>Sinatra</em> apps in <em>Apache</em>. </p>
<p><strong>Storage</strong><br />
<em>CouchDB</em> won this one. Key value storage is what we need so relational databases are a complication that we can happily forget about. Selling points of <em>CouchDB</em> was that it was easy to get started, it has an <em>HTTP RESTful API</em> and views are written as <em>Javascript</em> map-reduce. Map reduces is well suited to our calculation needs so we get a lot more done inside the database than we would do with i.e. <em>SQL</em>. The strongest alternative was definitely <em>MongoDB</em> which is very similar to <em>CouchDB</em>, but uses a more classical query language. We also briefly looked at <em>Voldemort</em>, but our calculation need is not suited to <em>Voldemort</em>&#8217;s simple key lookup scheme. <em>SimpleDB</em> ties us to <em>Amazon</em> and <em>Hadoop</em> is a bit too heavy for our needs. We didn&#8217;t consider <em>MySQL</em>, but that would have been our choice had we needed a relational database.</p>
<p><strong>Conclusion</strong><br />
Our <em>LARC</em> stack (<em>Linux</em> <em>Apache</em>, <em>Ruby</em>, <em>CouchDB</em>) works very well for us and that combination has enabled us to develop our service very quickly with the result being an easy system to maintain. We haven&#8217;t really looked back at those choices except for a few moments after upgrading to <em>CouchDB</em> 0.10.0 and seeing that our map reduces that put a lot of work in the reducers had to be rewritten. One look at <em>MongoDB</em>&#8217;s query language stopped those regrets. As mentioned when discussing web serving environments, we do consider switching from <em>Apache</em> to <em>nginx</em>, but it&#8217;s not a high priority thing and we are happy with <em>Apache</em> and the consideration comes more from curiosity than need. As for <em>CentOS</em> and <em>EC2</em>, they just work and gets out of the way which is exactly what they should do.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Tokyo Tyrant is cool]]></title>
<link>http://contourline.wordpress.com/2009/10/30/tokyo-tyrant-is-cool/</link>
<pubDate>Sat, 31 Oct 2009 06:30:35 +0000</pubDate>
<dc:creator>jmarca</dc:creator>
<guid>http://contourline.wordpress.com/2009/10/30/tokyo-tyrant-is-cool/</guid>
<description><![CDATA[Just to have a recollection of this later, some notes. setting up tokyo tyrant instances, one per mo]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Just to have a recollection of this later, some notes.</p>
<p>setting up tokyo tyrant instances, one per month.  I expect about 4 million records a day, so that is 120 million a month, so I set bnum to 480 million, which seems insane, but worth a shot</p>
<p>One thing I noticed was that in shifting from one day tests to one month populate, and with the bump up of bnum from 8 million (2 times 4 million) to 480 million, I&#8217;m noticing a significant speed drop on populating the data from four simultaneous processes (one for each of 4 months).</p>
<p>There is write delay of course, and that may be all of it, since the files are big now.</p>
<p>Perhaps there is a benefit from wider tables, rather than one row per data record?  Like one row per hour of data per sensor, or one row per 5 minutes, etc?</p>
<p>Also, as I wrapped up my initial one-day tests, I got some random crashes on my perl script stuffing data in.  Not sure why.  Could be because I was tweaking parameters and stuff.</p>
<p>One final point, the size of the one day of data in tokyo cabinet is about the same as the size of one day of data in couchdb.  I was hoping to get a much bigger size advantage (smaller file).  The source data is about 100M unzipped csv file, and it balloons to 600 M with bnum set at 8 million in a table database.  Of course, it isn&#8217;t strictly the same data&#8230; I am splitting the timestamp into parts so I can do more interesting queries without a lot of work (give me an average of data on Mondays in July; Tuesdays all year; 8 am to 9 am last Wednesday, etc.</p>
<p>So, slotting 4 months of data away.  I&#8217;ll check it again on Monday and see if it worked.</p>
<p>And by the way, I&#8217;m sure I&#8217;m not the best at this because I haven&#8217;t used it much, but it is orders of magnitude faster to use the COPY command via DBIx::Class to load CSV data into PostgreSQL.  Of course, I don&#8217;t want to have all of that data sitting in my relational database, but I&#8217;m just saying&#8230;</p>
<p>&#160;</p>
<p>&#160;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Putting stuff away]]></title>
<link>http://contourline.wordpress.com/2009/10/26/putting-stuff-away/</link>
<pubDate>Mon, 26 Oct 2009 16:34:04 +0000</pubDate>
<dc:creator>jmarca</dc:creator>
<guid>http://contourline.wordpress.com/2009/10/26/putting-stuff-away/</guid>
<description><![CDATA[Started testing out TokyoCabinet and TokyoTyrant last Friday, and got my initial test program runnin]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Started testing out <a href="http://1978th.net/tokyocabinet/" target="_blank">TokyoCabinet </a>and <a href="http://1978th.net/tokyotyrant/" target="_blank">TokyoTyrant</a> last Friday, and got my initial test program running this morning.  The documentation is pretty good, but I&#8217;m still floundering about a little bit.  Not sure what parameters to pass to the b+ tree database file to make it work well for my data; not sure how to set up multiple databases for sharding; etc etc.  On the plus side, my Perl code that loads the data is running at about 50% CPU, so it is doing something rather than waiting around for writes.  On the down side, now I have to write a small program to check on the progress of those writes to make sure that I am actually writing something!</p>
<p>Update.  I am comparing storing in TokyoTyrant with storing in CouchDB.  CouchDB it turns out is faster for me out of the box because of the way Erlang takes advantage of the multi-core processor.  Tokyo Tyrant server just maxes out one core, and so my loading programs wait around for the server to process the data.  CouchDB, on the other hand, will use up lots more cores (I&#8217;ve seen the process go about 400% in top).  So loading a year of data with one data reading process per month simultaneously, TokyoTyrant is only up to day 6 of each month, while my CouchDB loader programs are all up to about day 14 in each month.</p>
<p>I&#8217;m sure there is a way to set up TokyoTyrant to use multiple CPUs, but I can&#8217;t find it yet.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Experiências com o CouchDB]]></title>
<link>http://mauricioszabo.wordpress.com/2009/10/21/experiencias-com-o-couchdb/</link>
<pubDate>Thu, 22 Oct 2009 00:47:06 +0000</pubDate>
<dc:creator>Maurício Szabo</dc:creator>
<guid>http://mauricioszabo.wordpress.com/2009/10/21/experiencias-com-o-couchdb/</guid>
<description><![CDATA[Esses dias, após o Rails Summit, resolvi testar uma tecnologia que eu vi por lá e achei extremamente]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Esses dias, após o Rails Summit, resolvi testar uma tecnologia que eu vi por lá e achei extremamente interessante: CouchDB.</p>
<p>O CouchDB, projeto hospedado no incubador do Apache, é um banco de dados não-relacional, com algumas características interessantes: ao invés de ordenar os dados por tabela, você ordena por documentos, define um &#8220;Tipo&#8221; no documento, e todos ficam por lá. Depois, para buscar os documentos que você quer, usa-se um conceito chamado de View &#8211; basicamente, uma sequencia de duas funções Javascript que criam um índice para seus documentos, e expoem eles. Seu formato nativo é JSON &#8211; ou seja, atributos multivalorados, hashes, dinamicamente tipado e tudo o mais. Um exemplo típico de registro:</p>
<pre class="brush: jscript;">
{
  tipo: &#34;Pessoa&#34;,
  nome: &#34;Maurício Szabo&#34;
  idade: 21
}
</pre>
<p>Para listar apenas as pessoas (indexando-as por nome, por exemplo), usa-se uma função Javascript semelhante com a seguinte:</p>
<pre class="brush: jscript;">
function(doc) {
  if(doc.tipo &#38;&#38; doc.tipo == &#34;Pessoa&#34;) {
    emit(doc.nome, doc)
  }
}
</pre>
<p>Pensei com meus botões: &#8220;Puxa, que maravilha! Eu posso formatar <strong>no banco de dados</strong> os dados da forma que eu quero! E ele ainda indexa isso para mim!&#8221;. Aí, resolvi fazer uma experiência &#8211; portar o sistema de matrículas que desenvolvemos para CouchDB (afinal, eu não teria que me preocupar com validações, telas, HTML, etc, porque tudo isso já está pronto).</p>
<p><strong>O primeiro problema</strong> foi na hora de modelar os dados &#8211; o CouchDB permite que um documento seja relativamente grande, mas no sistema que montamos era possível cadastrar horários das disciplinas em simultâneo! Eu não podia colocar tudo isso num documento Disciplina, porque poderia haver conflitos (duas pessoas editando a mesma disciplina, porém editando horários diferentes). Acabei modelando meio &#8220;relacional&#8221;, mas vi que essa era uma falha da forma como eu tinha pensado no sistema inicialmente.</p>
<p><strong>O segundo problema</strong> foi na hora de buscar um período de matrícula &#8211; o CouchDB <em>simplesmente</em> não permite que eu busque usando dois campos! Logo, eu não poderia, por exemplo, buscar um <strong>periodo.inicio &#60; Date.today &#38;&#38; periodo.fim &#62; Date.today</strong>. Perguntei para diversas pessoas uma solução, e todas me mandaram a mesma resposta: &#8220;Use Views!&#8221;. Como? Até que, finalmente, na lista de mensagens do CouchDB eu consegui uma resposta: &#8220;Use o Lucene (uma extensão do Couch) ou use Views, emitindo um registro para cada dia que for ser executado o período de matrícula&#8221;. Ou seja, se eu tivesse um período de matrícula que fosse do dia 10/01 até o 15/01, eu precisaria emitir, para cada período disponível no banco de dados, 5 índices. Bom, nem tentei o Lucene &#8211; se um banco de dados não faz essa coisa tão básica sozinho, eu realmente nem quero ver o restante dele.</p>
<p>Ah, sim, até tinha pensado em fazer um aplicativo (tipo Wiki) usando CouchDB &#8211; para ver se quando você usar algo mais com cara de &#8220;Documento&#8221; ele se sai melhor. Acabou que, se um artigo tem várias Tags, eu não consigo buscar todos os posts que possuem as tags &#8220;Ruby&#8221; e &#8220;Rails&#8221;, por exemplo &#8211; eu só consigo buscar uma por vez. Acabei desistindo do CouchDB, e agora estou estudando o MongoDB.</p>
<p>Sim, o MongoDB permite eu fazer essas coisas. O problema dele são os Mapeadores Objeto-Relacionais (que tentam, digamos, deixar o MongoDB com cara de banco de dados relacional). Estou inclusive montando meu próprio ORM para Ruby &#8211; MongoParadigm. Novidades em breve.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Dev Days Revisited]]></title>
<link>http://filteryourinput.wordpress.com/2009/10/20/dev-days-revisited/</link>
<pubDate>Wed, 21 Oct 2009 04:23:33 +0000</pubDate>
<dc:creator>gravelpot</dc:creator>
<guid>http://filteryourinput.wordpress.com/2009/10/20/dev-days-revisited/</guid>
<description><![CDATA[Just a quick recap of the StackOverflow Dev Days conference that I attended last week&#8230; Overall]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://stackoverflow.com"><img align="right" class="alignright size-full wp-image-41" src="http://blogs.utexas.edu/pfg/files/2009/10/logo.png" alt="" width="250" height="61" /></a>Just a quick recap of the <a href="http://stackoverflow.carsonified.com/events/austin/">StackOverflow Dev Days</a> conference that I attended last week&#8230;</p>
<p>Overall, I thought it was a good experience. This is a one-day &#8220;roadshow&#8221; conference, where they are getting local speakers to talk on a group of topics in several different cities around the U.S. and Europe. The basic format was 30-55 minute talks on each topic, with a little time for Q&#38;A at the end. The Austin topics were:</p>
<ul>
<li> Keynote by Joel Spolsky</li>
<li>Python</li>
<li>iPhone development</li>
<li>FogBugz 7</li>
<li>asp.net MVC</li>
<li>jQuery</li>
<li>Erlang/CouchDB</li>
<li>Code Review Doesn&#8217;t Have to Suck</li>
</ul>
<p>I had to leave for a while in the middle of the day and missed ASP.NET (oh well) and jQuery (kind of a bummer, but hopefully not much I didn&#8217;t already know).</p>
<p>I enjoyed the rest of the sessions. Here are my brief impressions:</p>
<p><strong>Keynote</strong> &#8211; Joel gave an entertaining talk about the value of simplicity in developing software. Lots of clever examples of how developers ruin the user experience by presenting the user with too many options (dialogs, wizards, settings panels, etc.).</p>
<p><strong>Python</strong> &#8211; Eric Jones from <a href="http://www.enthought.com/">enthought</a> gave a good introduction to the power of Python by doing a code walkthrough of Peter Norvig&#8217;s <a href="http://norvig.com/spell-correct.html">21 line spell-checker</a>. Very cool. He also showed some of the mathematical and scientific visualization packages that his company has developed in Python. Interestingly, since all of our activity around Python at UT involves Django and web applications, there was zero web content to this presentation.</p>
<p><strong>iPhone</strong> &#8211; Jon Johnson ran through putting together a quick demo iPhone app, as well as discussing the pros and cons of iPhone development. It seems like the level of control that Apple exerts over what goes in to the App Store is unprecedented, and a bit unpredictable/inconsistent. He told a story about submitting several basically identical apps (for different conferences) to the store and having all but one accepted.</p>
<p><strong>FogBugz 7</strong> &#8211; Basically a sales demo for the new release of FogCreek&#8217;s flagship product. I think we sat through this to subsidize the rest of the day, which was a worthwhile trade. It looks like a great product, and the new hosted Mercurial integration and code review tools look really cool, but we already have Jira, and I don&#8217;t see us moving to a commercial hosted product anytime soon.</p>
<p><strong>Erlang/CouchDB</strong> &#8211; Probably my favorite presentation of the day. I had read a bit about CouchDB (and key-value pair databases, generally), but still wasn&#8217;t quite sure how they worked, or what problem they were designed to solve. I&#8217;m still not an expert, but Damien Katz was an entertaining and engaging speaker, and I think I understand the product class better now. We didn&#8217;t get to see any actual Erlang code, though. <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_sad.gif' alt=':-(' class='wp-smiley' /> </p>
<p><strong>Code Review Doesn&#8217;t Have To Suck</strong> &#8211; Jason Cohen gets two awards for the day &#8212; one for the most entertaining presenter, and the second for giving an hour-long presentation about code review without pushing the <a href="http://smartbear.com/codecollab.php">collaborative code review product</a> that he was there at the conference to sell (as a sponsor). And his talk was chock-full of informative bits of research-backed advice about how to do code review right (ex: Code review sessions should last one hour or less; after that time, the number of bugs found per lines of code reviewed drops off dramatically).</p>
<p>Some of the folks I&#8217;ve talked to didn&#8217;t actually enjoy the conference that much &#8212; they felt that there wasn&#8217;t a consistent focus, or that the presentations were basically pitches for particular products. And there&#8217;s some truth in both of those statements. But, I&#8217;m not going to any other conferences anytime soon, and it was good to experience the wider Austin development community. If they do it again next year, I&#8217;ll have to look hard at the topic mix to see if it seems like it will offer any new content, but for $99, I&#8217;d probably do it again.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Install Apache couchdb on MacOSX]]></title>
<link>http://dboettger.wordpress.com/2009/10/18/install-apache-couchdb-on-macosx/</link>
<pubDate>Sun, 18 Oct 2009 17:24:36 +0000</pubDate>
<dc:creator>dboettger</dc:creator>
<guid>http://dboettger.wordpress.com/2009/10/18/install-apache-couchdb-on-macosx/</guid>
<description><![CDATA[Last year on the PHP Conference in Mainz it was the first time i heard about couchdb. I was quite im]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Last year on the PHP Conference in Mainz it was the first time i heard about couchdb. I was quite impressed, but had no use case for this database. In the near future i will need to improve our document handling in the applications. So i want to work with couchdb to improve versioning and searching.</p>
<p>You can find more information about couchdb on http://couchdb.apache.org.</p>
<p>As anybody know, the first step is to install the couchdb on the local system. As i am a apple follower <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> . I looked for a guide to install it on snow leopard.</p>
<p>Original information from http://blog.deadinkvinyl.com/2008/07/12/couchdb-on-macosx-leopard/</p>
<p>I did some corrections for the copy and paste guys like me <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> . But the original post has much more information about the single steps.</p>
<p># Install needed packages via macports<br />
sudo port install icu erlang spidermonkey</p>
<p># Download the latest couchdb file<br />
tar xvzf apache-couchdb-0.10.0.tar.gz<br />
./configure<br />
make</p>
<p>sudo make install</p>
<p># Show all used userId&#8217;s<br />
dscl . -list /Users UniqueID &#124; awk &#8216;{print $2}&#8217; &#124; sort -n</p>
<p># Show all used groupId&#8217;s<br />
dscl . -list /Groups PrimaryGroupID &#124; awk &#8216;{print $2}&#8217; &#124; sort -n</p>
<p># We use groupid and userid 103 for the couchdbuser<br />
sudo dseditgroup -o create -i 103 -r &#8220;CouchDB Users&#8221; couchdb<br />
sudo dscl . -create /Users/couchdb<br />
sudo dscl . -create /Users/couchdb UniqueID 103<br />
sudo dscl . -create /Users/couchdb UserShell /bin/bash<br />
sudo dscl . -create /Users/couchdb RealName &#8220;CouchDB Administrator&#8221;<br />
sudo dscl . -create /Users/couchdb NFSHomeDirectory \<br />
/usr/local/var/lib/couchdb<br />
sudo dscl . -create /Users/couchdb PrimaryGroupID 103<br />
sudo dscl . -create /Users/couchdb Password *</p>
<p>sudo chown -R couchdb:couchdb /usr/local/var/lib/couchdb<br />
sudo chown -R couchdb:couchdb /usr/local/var/log/couchdb</p>
<p># Startup couchdb</p>
<p>sudo -u couchdb couchdb</p>
<p>Create a copy of the plist file and edit it.</p>
<p>cp /usr/local/Library/LaunchDaemons/org.apache.couchdb.plist \<br />
/var/tmp/org.apache.couchdb.plist<br />
open /var/tmp/org.apache.couchdb.plist</p>
<p>1. Open Root → EnvironmentVariables<br />
2. Click on Add Child<br />
3. Name: PATH<br />
4. Value: /bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/local/bin:/opt/local/sbin<br />
5. File → Save<br />
6. Quit Property List Editor</p>
<p>sudo cp /var/tmp/org.apache.couchdb.plist \<br />
/usr/local/Library/LaunchDaemons/org.apache.couchdb.plist</p>
<p># To control the database, simply execute<br />
sudo launchctl load \<br />
/usr/local/Library/LaunchDaemons/org.apache.couchdb.plist<br />
# to start<br />
# or<br />
sudo launchctl unload \<br />
/usr/local/Library/LaunchDaemons/org.apache.couchdb.plist<br />
# to stop the database</p>
<p># Automatically launch<br />
sudo ln -s /usr/local/Library/LaunchDaemons/org.apache.couchdb.plist \<br />
/Library/LaunchDaemons/org.apache.couchdb.plist</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Double Shot #562]]></title>
<link>http://afreshcup.com/2009/10/15/double-shot-562/</link>
<pubDate>Thu, 15 Oct 2009 10:28:28 +0000</pubDate>
<dc:creator>Mike Gunderloy</dc:creator>
<guid>http://afreshcup.com/2009/10/15/double-shot-562/</guid>
<description><![CDATA[Today it would be nice to hit a few home runs. Browsera &#8211; Automated browser compatibility test]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Today it would be nice to hit a few home runs.</p>
<ul>
<li><strong><a href="http://www.browsera.com/">Browsera</a></strong> &#8211; Automated browser compatibility testing with a free trial available.</li>
<li><strong><a href="http://fit.rubyforge.org/">RubyFIT 1.2</a></strong> &#8211; Port of the FIT testing framework from Java to Ruby.</li>
<li><strong><a href="http://banisterfiend.wordpress.com/2009/10/14/the-devil-image-library-for-ruby/">The DevIL Image Library For Ruby</a></strong> &#8211; Another alternative for loading, saving, thumbnailing, scaling, rotations, and so on.</li>
<li><strong><a href="http://www.markrichman.com/2009/10/14/tools-of-the-trade/">Tools of the Trade</a></strong> &#8211; Mark Richman chimes in on the theme.</li>
<li><strong><a href="http://litanyagainstfear.com/blog/2009/10/14/gem-bundler-is-the-future/">Gem Bundler is the Future</a></strong> &#8211; A rundown on the new plan for handling dependency resolution in Rails 3. I&#8217;m not yet convinced.</li>
<li><strong><a href="http://mail-archives.apache.org/mod_mbox/couchdb-dev/200910.mbox/%3C4AD53996.3090104@canonical.com%3E">CouchDB in Ubuntu</a></strong> &#8211; Looks like Ubuntu 10.04 is going to set every user up with CouchDB for cloudish storage.</li>
</ul>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Logs with CouchLog]]></title>
<link>http://baldrailers.wordpress.com/2009/10/12/logs-with-couchlog/</link>
<pubDate>Mon, 12 Oct 2009 07:17:26 +0000</pubDate>
<dc:creator>baldrailers</dc:creator>
<guid>http://baldrailers.wordpress.com/2009/10/12/logs-with-couchlog/</guid>
<description><![CDATA[Ever wanted to have your Rails logs into a different location? Better yet&#8230; a database. I am ve]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Ever wanted to have your Rails logs into a different location? Better yet&#8230; a database. I am very lucky to play around with <a href="http://www.adsdevshop.com/">ADS</a>(<em>Atlantic Dominion Solutions</em>) opensource project <a href="http://github.com/ads/couchlog">CouchLog</a>. &#8220;<span id="repository_description"><em>A Rails plugin that uses CouchDB to store app logs rather than the standard log files.</em>&#8220;</span></p>
<p>If you don&#8217;t have <a href="http://couchdb.apache.org">CouchDB</a> installed in your system, you can check on my other <a href="http://baldrailers.wordpress.com/2009/10/12/relaxed-installation-of-couchdb">post</a> for installation.</p>
<p>Here&#8217;s how you integrate it to your Rails Application as plugin:</p>
<pre style="white-space:pre-wrap;color:#63ff00;background-image:initial;background-repeat:initial;background-attachment:initial;background-color:#000000;font:normal normal normal 10px/normal 'bitstream vera sans mono', monaco, 'lucida console', 'courier new', courier, serif;background-position:initial initial;padding:5px;"><code style="font:normal normal normal 10px/normal 'bitstream vera sans mono', monaco, 'lucida console', 'courier new', courier, serif;">script/plugin install git://github.com/ads/couchlog.git</code></pre>
<p>Upon installation of the plugin it automatically creates the <strong>couchlog.rb</strong> in your <em>lib</em> folder and a sample <strong>couchlog.yml</strong> file in your <em>config</em> directory as well. With the <strong>couchlog.yml</strong> contains the information about your <a href="http://couchlog.apache.org">CouchDB</a> server.</p>
<p>Add the following lines in your environment.rb</p>
<pre style="white-space:pre-wrap;color:#63ff00;background-image:initial;background-repeat:initial;background-attachment:initial;background-color:#000000;font:normal normal normal 10px/normal 'bitstream vera sans mono', monaco, 'lucida console', 'courier new', courier, serif;background-position:initial initial;padding:5px;"># config/environment.rb
require 'couch_log'</pre>
<pre style="white-space:pre-wrap;color:#63ff00;background-image:initial;background-repeat:initial;background-attachment:initial;background-color:#000000;font:normal normal normal 10px/normal 'bitstream vera sans mono', monaco, 'lucida console', 'courier new', courier, serif;background-position:initial initial;padding:5px;">config.log_level = ENV['RAILS_ENV']=='production' ?
ActiveSupport::BufferedLogger::Severity::INFO :
ActiveSupport::BufferedLogger::Severity::DEBUG

config.logger = CouchLog.new(config.log_path, config.log_level)</pre>
<p>CouchLog is still under development, feel free to fork it.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Relaxed Installation of CouchDB]]></title>
<link>http://baldrailers.wordpress.com/2009/10/12/relaxed-installation-of-couchdb/</link>
<pubDate>Mon, 12 Oct 2009 06:55:59 +0000</pubDate>
<dc:creator>baldrailers</dc:creator>
<guid>http://baldrailers.wordpress.com/2009/10/12/relaxed-installation-of-couchdb/</guid>
<description><![CDATA[This is a very short guide about installation of Couchdb in Ubuntu. I would suspect that you already]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>This is a very short guide about installation of <a title="couchdb" href="http://couchdb.apache.org/">Couchdb</a> in Ubuntu. I would suspect that you already updated your system, if not&#8230; you can do it first before moving forward with this guide.</p>
<pre style="white-space:pre-wrap;color:#63ff00;background-image:initial;background-repeat:initial;background-attachment:initial;background-color:#000000;font:normal normal normal 10px/normal 'bitstream vera sans mono', monaco, 'lucida console', 'courier new', courier, serif;background-position:initial initial;padding:5px;"><code style="font:normal normal normal 10px/normal 'bitstream vera sans mono', monaco, 'lucida console', 'courier new', courier, serif;">sudo apt-get update</code></pre>
<p>After that, we are good to go:</p>
<pre style="white-space:pre-wrap;color:#63ff00;background-image:initial;background-repeat:initial;background-attachment:initial;background-color:#000000;font:normal normal normal 10px/normal 'bitstream vera sans mono', monaco, 'lucida console', 'courier new', courier, serif;background-position:initial initial;padding:5px;">sudo apt-get install couchdb</pre>
<p>That&#8217;s it. Simple and Relax installation&#8230; You can see it by pointing your browser to http://localhost:5984/_utils/</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Web Konferencia 2009]]></title>
<link>http://rubymood.wordpress.com/2009/10/08/web-konferencia-2009/</link>
<pubDate>Thu, 08 Oct 2009 12:48:00 +0000</pubDate>
<dc:creator>rubymood</dc:creator>
<guid>http://rubymood.wordpress.com/2009/10/08/web-konferencia-2009/</guid>
<description><![CDATA[Idén sikerült végre nem munkanapra helyezni a web konferencia időpontját úgyhogy be is regisztráltam]]></description>
<content:encoded><![CDATA[Idén sikerült végre nem munkanapra helyezni a web konferencia időpontját úgyhogy be is regisztráltam]]></content:encoded>
</item>
<item>
<title><![CDATA[[RT] MyCouch]]></title>
<link>http://lsimons.wordpress.com/2009/10/07/rt-mycouch/</link>
<pubDate>Tue, 06 Oct 2009 23:40:33 +0000</pubDate>
<dc:creator>Leo Simons</dc:creator>
<guid>http://lsimons.wordpress.com/2009/10/07/rt-mycouch/</guid>
<description><![CDATA[The below post is an edited version of a $work e-mail, re-posted here at request of some colleagues ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>The below post is an edited version of a <code>$work</code> e-mail, re-posted here at request of some colleagues that wanted to forward the story. My apologies if some of the bits are unclear due to lack-of-context. In particular, let me make clear:</p>
<ul>
<li>we have had a production CouchDB setup for months that works well</li>
<li>we are planning to keep that production setup roughly intact for many more months and we are <em>not</em> currently planning to migrate away from CouchDB <em>at all</em></li>
<li>overall we are big fans of the CouchDB project and its community and we expect great things to come out of it</li>
</ul>
<p>Nevertheless using pre-1.0 software based on an archaic language with rather crappy error handling can get frustrating <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<pre>
Subject: [RT] MyCouch
From: Leo Simons
To: Forge Engineering
</pre>
<p>This particular RT gives one possible answer to the question &#8220;what would be a good way to make this KV debugging somewhat less frustrating?&#8221; <em>(we have been fighting erratic response times from CouchDB under high load while replicating and compacting)</em></p>
<p>That answer is &#8220;we could probably replace CouchDB with java+mysql, and it might even be easy to do so&#8221;. And, then, &#8220;if it really is easy, that&#8217;s extra cool (and _because of_ CouchDB)&#8221;.)</p>
<h4>Why replace CouchDB?</h4>
<p>Things we really like about CouchDB (as the backend for our KV service):</p>
<ul>
<li>The architecture: HTTP/REST all the way down, MVCC, many-to-many replication, scales without bound, neat composable building blocks makes an evolvable platform.</li>
<li>Working system: Its in production, its running, its running pretty well.</li>
<li>Community: open source, active project, know the developers, &#8220;cool&#8221;.</li>
<li>Integrity: it hasn&#8217;t corrupted or lost any data yet, and it probably won&#8217;t anytime soon.</li>
</ul>
<p>Things we like less:</p>
<ul>
<li>Debugging: cryptic error messages, erlang stack straces, process deaths.</li>
<li>Capacity planning: many unknown and changing performance characteristics.</li>
<li>Immaturity: pre-1.0.</li>
<li>Humanware: lack of erlang development skills, lack of DBA-like skills, lack of training material (or trainers) to gain skills.</li>
<li>Tool support: JProfiler for erlang? Eclipse for erlang? Etc.</li>
<li>Map/Reduce and views: alien concept to most developers, hard to audit and manage free-form javascript from tenants, hard to use for data migrations and aggregations.</li>
<li>JSON: leads to developers storing JSON which is horribly inefficient.</li>
</ul>
<p>Those things we don&#8217;t like about couch unfortunately aren&#8217;t going to change very quickly. For example, the effort required to train up a bunch of DBAs so they can juggle CouchDB namespaces and instances and on-disk data structures is probably rather non-trivial.</p>
<h4>The basic idea</h4>
<p>It is not easy to see what other document storage system out there would be a particularly good replacement. Tokyo Cabinet, Voldemort, Cassandra, &#8230; all of these are also young and immature systems with a variety of quirks. Besides, we really really like the CouchDB architecture.</p>
<p>So why don&#8217;t we replace CouchDB with a re-implemented CouchDB? We keep the architecture almost exactly the same, but re-implement the features we care about using technology that we know well and is in many ways much more boring. &#8220;HTTP all the way down&#8221; should mean this is possible.</p>
<p>We could use mysql underneath (but not use any of its built-in replication features). The java program on top would do the schema and index management, and most importantly implement the CouchDB replication and compaction functionality.</p>
<p>We could even keep the same deployment structure. Assuming one java server is paired with one mysql database instance, we&#8217;d end up with 4 tomcat instances on 4 ports (5984-5987) and 4 mysql services on 4<br />
other ports (3306-3309). Use of mysqld_multi probably makes sense. Eventually we could perhaps optimize a bit more by having one tomcat process and one mysql process &#8211; it&#8217;ll make better use of memory.</p>
<p>Now, what is really really really cool about the CouchDB architecture and its complete HTTP-ness is that we should be able to do any actual migration one node at a time, without downtime. Moving the data across<br />
is as simple as running a replication. Combined with the fact that we&#8217;ve been carefully avoiding a lot of its features, CouchDB is probably one of the _easiest_ systems to replace <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':-D' class='wp-smiley' /> </p>
<h4>Database implementation sketch</h4>
<p>How would we implement the database? If we think of our KV data as having the form</p>
<pre>
  ns1:key1 [_rev=1-12345]: { ...}
  ns1:key2 [_rev=2-78901]: { subkey1: ..., }
  ns2:key3 [_rev=1-43210]: { subkey1: ..., subkey2: ...}
</pre>
<p>where the first integer part of the _rev is dubbed &#8220;v&#8221; and the remainder part as &#8220;src&#8221;, then a somewhat obvious database schema looks like <em>(disclaimer: schema has not been tested, do not use <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> )</em>:</p>
<pre>
CREATE TABLE namespace (
  id varchar(64) NOT NULL PRIMARY KEY
      CHARACTER SET ascii COLLATE ascii_bin,
  state enum('enabled','disabled','deleted') NOT NULL
) ENGINE=InnoDB;

CREATE TABLE {namespace}_key (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,

  PRIMARY KEY (ns, key, v, src),
  FOREIGN KEY (ns) REFERENCES namespace(id)
) ENGINE=InnoDB;

CREATE TABLE {namespace}_value (
  ns varchar(64) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  key varchar(180) NOT NULL
      CHARACTER SET ascii COLLATE ascii_bin,
  v smallint UNSIGNED NOT NULL,
  src int UNSIGNED NOT NULL,
  subkey varchar(255) NOT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci,
  small_value varchar(512) DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if it fits',
  large_value mediumtext DEFAULT NULL
      CHARACTER SET utf8 COLLATE utf8_general_ci
      COMMENT 'will contain the value if its big',

  PRIMARY KEY (ns, key, v, src, subkey),
  FOREIGN KEY (ns) REFERENCES namespace(id),
  FOREIGN KEY (ns, key, v, src)
      REFERENCES {namespace}_key(ns, key, v, src)
      ON DELETE CASCADE
) ENGINE=InnoDB;
</pre>
<p>With obvious queries including</p>
<pre>
  SELECT id FROM namespace WHERE state = 'enabled';

  SELECT key FROM {namespace}_key WHERE namespace_id = ?;
  SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ?;
  SELECT v, src FROM {namespace}_key WHERE namespace_id = ?
      AND key = ? ORDER BY version DESC LIMIT 1;
  SELECT subkey, small_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?;
  SELECT large_value FROM {namespace}_value
      WHERE namespace_id = ? AND key = ? AND v = ? AND src = ?
      AND subkey = ?;

  BEGIN;
  CREATE TABLE {namespace}_key (...);
  CREATE TABLE {namespace}_value (...);
  INSERT INTO namespace(id) VALUES (?);
  COMMIT;

  UPDATE namespace SET state = 'disabled' WHERE id = ?;
  UPDATE namespace SET state = 'deleted' WHERE id = ?;

  BEGIN;
  DROP TABLE {namespace}_value;
  DROP TABLE {namespace}_key;
  DELETE FROM namespace WHERE id = ?;
  COMMIT;

  INSERT INTO {namespace}_key (ns,key,v,src)
      VALUES (?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,small_value)
      VALUES (?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?),(?,?,?,?,?);
  INSERT INTO {namespace}_value (ns,key,v,src,large_value)
      VALUES (?,?,?,?,?);

  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v &#60; ?;
  DELETE FROM {namespace}_key WHERE ns = ? AND key = ?
      AND v = ? AND src =?;
</pre>
<p>The usefulness for <code>{namespace}_value</code> is debatable; it helps a lot when implementing CouchDB views or some equivalent functionality (&#8220;get my all the documents in this namespace where subkey1=&#8230;&#8221;), but if we decide not to care, then its redundant and <code>{namespace}_key</code> can grow some additional small_value (which should then be big enough to contain a typical JSON document, i.e. maybe 1k) and large_value columns instead.</p>
<p>Partitioning the tables by <code>{namespace}</code> manually isn&#8217;t needed if we use MySQL 5.1 or later; table partitions could be used instead.</p>
<p>I&#8217;m not sure if we should have a &#8217;state&#8217; on the keys and do soft-deletes; that might make actual DELETE calls faster; it could also reduce the impact of compactions.</p>
<h4>Webapp implementation notes</h4>
<p>The java &#8220;CouchDB&#8221; webapp also does not seem that complicated to build (famous last words?). I would probably build it roughly the same way as <em>[some existing internal webapps]</em>.</p>
<p>The basic GET/PUT/DELETE operations are straightforward mappings onto queries that are also rather straightforward.</p>
<p>The POST /_replicate and POST /_compact operations are of course a little bit more involved, but not that much. Assuming some kind of a pool of url fetchers and some periodic executors&#8230;</p>
<p><strong>Replication:</strong></p>
<ol>
<li>get last-seen revision number for source</li>
<li>get list of updates from source</li>
<li>for each update
<ul>
<li>INSERT key</li>
<li>if duplicate key error, ignore and don&#8217;t update values</li>
<li>INSERT OR REPLACE all the values</li>
</ul>
</li>
</ol>
<p><strong>Compaction:</strong></p>
<ol>
<li>get list of namespaces</li>
<li>for each namespace:
<ul>
<li><code>SELECT key, v, src FROM {namespace}_key WHERE namespace_id = ? ORDER BY key ASC, v DESC, src DESC;</code></li>
<li>skip the first row for each key</li>
<li>if the second row for the key is the same v, conflict, don&#8217;t compact for this key</li>
<li><code>DELETE IGNORE FROM {namespace}_key WHERE ns = ? AND key = ? AND v = ? AND src =?;</code></li>
</ul>
</li>
</ol>
<p>So we need some kind of a replication record; once we have mysql available using &#8220;documents&#8221; seems awkward; let&#8217;s use a database table. We might as well have one more MySQL database on each server with a<br />
full copy of a &#8216;kvconfig&#8217; database, which is replicated around (using mysql replication) to all the nodes. Might also want to migrate away from NAMESPACE_METADATA documents&#8230;though maybe not, it <em>is</em> nice and flexible that way.</p>
<h4>Performance notes</h4>
<p>In theory, the couchdb on-disk format should be much faster than innodb for writes. In practice, innodb has seen quite a few years of tuning. More importantly, in our tests on our servers raw mysql performance seems to be rather better than couchdb. Some of that is due to the extra fsyncs in couchdb, but not all of it.</p>
<p>In theory, the erlang OTP platform should scale out much better than something java-based. In practice, the http server inside couchdb is pretty much a standard fork design using blocking I/O. More importantly, raw tomcat can take &#62;100k req/s on our hardware, which is much much more than our disks can do.</p>
<p>In theory, having the entire engine inside one process should be more efficient than java talking to mysql over TCP. In practice, I doubt this will really show up if we run java and mysql on the same box. More importantly, if this does become an issue, longer-term we may be able to &#8220;flatten the stack&#8221; by pushing the java &#8220;CouchDB&#8221; up into the service layer and merging it with the KV service, at which point java-to-mysql will be rather more efficient than java-to-couch.</p>
<p>In theory and in practice innodb has better indexes for the most common SELECTs/GETs so it should be a bit faster. It also is better at making use of large chunks of memory. I suspect the two most common requests (GET that returns 200, GET that returns 404) will both be faster, which incidentally are the most important for us to optimize, too.</p>
<p>We might worry java is slow. That&#8217;s kind-of silly <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . In theory and in practice garbage collection makes software go faster. We just need to avoid doing those things that make it slow.</p>
<p>The overhead of ACID guarantees might be a concern. Fortunately MySQL is not _really_ a proper relational database if you don&#8217;t want it to be. We can probably set the transaction isolation level to READ UNCOMMITTED safely, and the schema design / usage pattern is such that we don&#8217;t need transactions in most places. More importantly we are keeping the eventual consistency model, with MVCC and all, on a larger scale. Any over-ACID-ness will be local to the particular node only.</p>
<p>Most importantly, this innodb/mysql thing is mature/boring technology that powers a lot of the biggest websites in the world. As such, you can buy books and consultancy and read countless websites about mysql/innodb/tomcat tuning. Its performance characteristics are pretty well-known and pretty predictable, and lots of people (including here at $work) can make those predictions easily.</p>
<h4>So when are we doing this?</h4>
<p>No no, we&#8217;re not, that&#8217;s not the point, this is just a RT! I woke up (rather early) with this idea in my head so I wrote it down to make space for other thoughts. At a minimum, I hope the above helps propagate some ideas:</p>
<ul>
<li>just how well we applied REST and service-oriented architecture here and the benefits its giving us</li>
<li>in particular because we picked the right architecture we are not stuck with / tied to CouchDB, now or later</li>
<li>we can always re-engineer things (though we should have good enough reasons)</li>
<li>things like innodb and/or bdb (or any of the old dbs) are actually great tools with some great characteristics</li>
</ul>
<h4>Just like FriendFeed?</h4>
<p>Bret Taylor has a good <a href="http://bret.appspot.com/entry/how-friendfeed-uses-mysql">explanation how FriendFeed built a non-relational database on top of a relational one</a>. The approach outlined above reminds rathe a lot of the solution they implemented, though there&#8217;s also important differences.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[CouchDB]]></title>
<link>http://simposiotecnico.wordpress.com/2009/10/05/couchdb/</link>
<pubDate>Tue, 06 Oct 2009 01:57:15 +0000</pubDate>
<dc:creator>belzabub</dc:creator>
<guid>http://simposiotecnico.wordpress.com/2009/10/05/couchdb/</guid>
<description><![CDATA[Videos Google Tech Talk CouchDB and Me CouchDB From 10,000 Feet CouchDB in a Real-World Setting Docu]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Videos<br />
<a href="http://www.youtube.com/watch?v=ESDBM9-U804">Google Tech Talk</a><br />
<a href="http://www.infoq.com/presentations/katz-couchdb-and-me">CouchDB and Me</a><br />
<a href="http://www.infoq.com/presentations/couchDB-from-10K-feet">CouchDB From 10,000 Feet</a><br />
<a href="http://www.infoq.com/presentations/CouchDB-Real-World-Setting-Jan-Lehnardt">CouchDB in a Real-World Setting</a></p>
<p>Documentation<br />
<a href="http://www.slideshare.net/jchrisa/couchdb-local-web-platform">Slides</a><br />
<a href="http://books.couchdb.org/relax/">Book</a><br />
<a href="http://www.infoq.com/news/2007/11/the-rdbms-is-not-enough">The RDBMS is not enough</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[The headache of mapping shards to servers]]></title>
<link>http://lsimons.wordpress.com/2009/10/01/the-headache-of-mapping-shards-to-servers/</link>
<pubDate>Thu, 01 Oct 2009 20:48:34 +0000</pubDate>
<dc:creator>Leo Simons</dc:creator>
<guid>http://lsimons.wordpress.com/2009/10/01/the-headache-of-mapping-shards-to-servers/</guid>
<description><![CDATA[At work we had a lot of headache figuring out how to reshard our CouchDB data. We have 2 data center]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>At work we had a lot of headache figuring out how to reshard our CouchDB data. We have 2 data centers with 16 CouchDB instances each. One server holds 4 CouchDB nodes. At the moment each data center has one copy of the data. We want to improve resilience so we are changing this so that each data center has two copies of the data (on different nodes of course). Figuring out how to reshard was not so simple at all.</p>
<p>This inspired some thinking about how we would do the same to our MySQL instances. Its not a challenge yet (we have much more KV data than relational data) but if we&#8217;re lucky we&#8217;ll get much more data pretty soon, and the issue will pop up in a few months.</p>
<p>I was thinking about what makes this so hard to think about (for someone with a small brain like me at least). It is probably about letting go of symmetry. Do you have the same trouble?</p>
<p>Imagine you have a database with information about users and perhaps data for/by those users. You might use horizontal partitioning with consistent hashing to distribute this data across two machines, which use master/slave replication between them for resilience. You might partition the data into four shards so you can scale out to 4 physical master servers later without repartitioning. It&#8217;d look like this:</p>
<div><img src="http://lsimons.wordpress.com/files/2009/10/mysql-replication.png" alt="diagram of shards on 2 mirrored servers" title="mysql-replication" width="225" height="227" /></div>
<p>Now imagine that you add a new data center and you need to split the data between the two data centers. Assuming your database only does master/slave (like MySQL) and you prefer to daisy-chain the replication it might look like this:</p>
<div><img src="http://lsimons.wordpress.com/files/2009/10/mysql-dual-site.png" alt="diagram of shards on 2 servers in 2 data centers" title="mysql-dual-site" width="464" height="236" /></div>
<p>Now imagine adding a third data center and an additional machine in each data center to provide extra capacity for reads. Maybe:</p>
<div><img src="http://lsimons.wordpress.com/files/2009/10/mysql-multi-site.png" alt="diagram of shards on 3 servers in 3 data centers" title="mysql-multi-site" width="636" height="370" /></div>
<p>You can see that the configuration has suddenly become a bit unbalanced and also somewhat non-obvious. Given this availability of hardware, the most resilient distribution of masters and slaves is not symmetric at all. When you lose symmetry the configuration becomes much more complicated to understand and manage.</p>
<p>Now imagine a bigger database. To highlight how to deal with asymmetry, imagine 4 data centers with 3 nodes per data center. Further imagine having 11 shards to distribute, wanting at least one slave in the same data center as its master. Further imagine wanting 3 slaves for each master. Further imagine wanting to have the data split as evenly as possible across all data centers, so that you can lose up to half of them without losing any data.</p>
<p>Can you work out how to do that configuration? Maybe:</p>
<div><img src="http://lsimons.wordpress.com/files/2009/10/circles-in-boxes1.png" alt="diagram showing many shards on many servers in many data centers" title="circles-in-boxes" width="592" height="578" /></div>
<p>Can you work out how to do it for, oh, 27 data centers, 400 shards with at least 3 copies of each shard, with 7 data centers having 15% beefier boxes and one data center having twice the number of boxes?</p>
<p>As the size of the problem grows, a diagram of a good solution looks more and more like a circle that quickly becomes hard to draw.</p>
<p>When you add in the need to be able to reconfigure server allocations, to fail over masters to one of their slaves, and more, it turns out you eventually need software assistance to solve the mapping of shards to servers. We haven&#8217;t written such software yet; I suspect we can steal it from Voldemort by the time we need it <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>The tipping point is when you lose symmetry (the 3rd data centre, the 5th database node, etc).</p>
<p>Consistent hashing helps make it easier to do resharding and node rewiring, especially when you&#8217;re not (only) dependent on something like mysql replication. But figuring out what goes in which bucket is not as easy as figuring out which bucket goes where. Unless you have many hundreds of buckets, then maybe you can assume your distribution of buckets is even enough if you hash the bucket id to find the server id.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Larry Ellison to SUN Customers – Poster]]></title>
<link>http://israelany.wordpress.com/2009/10/01/larry-ellison-to-sun-customers-%e2%80%93-poster/</link>
<pubDate>Thu, 01 Oct 2009 06:18:56 +0000</pubDate>
<dc:creator>israelagnouhyattara</dc:creator>
<guid>http://israelany.wordpress.com/2009/10/01/larry-ellison-to-sun-customers-%e2%80%93-poster/</guid>
<description><![CDATA[Larry Ellison&#8217;s is easily one of the greatest strategists in the software industry (ever). Let]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><img class="alignnone size-full wp-image-251" title="Oracle Plans for SUN Customers after Acquisition" src="http://israelany.wordpress.com/files/2009/10/oracle.jpg" alt="Oracle Plans for SUN Customers after Acquisition" width="500" height="727" /></p>
<p>Larry Ellison&#8217;s is easily one of the greatest strategists in the software industry (ever). Let&#8217;s all hope that his role at Oracle comes in handy for MySQL and Solaris to receive continued support and deserved upgrades at their new home.</p>
<p>The NoSQL movement still has a lot more ground to cover before capturing the full attention of IT managers and developers over using RDBMS-based systems (ie: MySQL). Oracle still has time dress MySQL up, or trash it, in favor of a NoSQL-based data management suite.</p>
<p>Michael Stonebraker (co-creator of Ingres and PostGres), in the 70&#8217;s, critized the usability, scalability, and performance of RDBMS-based systems. His reasons being that:</p>
<blockquote><p>In the data warehouse market, a column store beats a row store by approximately a factor of 50 on typical business intelligence queries. The reason is because column stores read only the columns of interest to the query and not all of them. In addition, compression is more effective in a column store. Since the legacy systems are all row stores, they are vulnerable to competition from the newer column stores.</p>
<p>… In the online transaction processing (OLTP) market, a lightweight main memory DBMS beats a row store by a factor of 50. Leveraging main memory and the fact that no DBMS application will send a message to a human user in the middle of a transaction, allows an OLTP DBMS to run transactions to completion with no resource contention or locking overhead.</p>
<p>… In the science DBMS market, users have never liked relational DBMSs and  want a non-relational model and query facility.</p>
<p>… Text applications have never used relational DBMSs. This was pointed out to me most clearly by Eric Brewer nearly 15 years ago in the early days of Inktomi. He wanted to use a relational DBMS to store the results of Web crawling, but found RDBMS to be two orders of magnitude slower than a home-brew system. All the major Web-search engines use home-brew text software to serve us search results. None use relational DBMSs.</p></blockquote>
<p>He has predicted their demise and saw the dawn of a new era decades ago:<br />
<a title="InfoQ - NoSQL and End of RDBMS Era" href="http://www.infoq.com/news/2009/08/NoSQL-and-the-End-of-RDBMS-Era" target="_self">http://www.infoq.com/news/2009/08/NoSQL-and-the-End-of-RDBMS-Era</a></p>
<p>All is not so gravy yet for the NoSQL movement:<br />
<a title="NoSQL - If Only It Was That Easy" href="http://bjclark.me/2009/08/04/nosql-if-only-it-was-that-easy/" target="_self">http://bjclark.me/2009/08/04/nosql-if-only-it-was-that-easy/</a></p>
<p>I wouldn&#8217;t be surprised if Oracle jumped on the NoSQL train in order for it to compete with open-source packages in the enterprise.</p>
<p>DBA&#8217;s need to start planning on migrating and learning architectures of the already (thriving) open-source projects such as:</p>
<p><a title="CouchDB" href="http://couchdb.apache.org/" target="_self">http://couchdb.apache.org</a><a title="MongoDB" href="http://www.mongodb.org/" target="_self"><br />
http://www.mongodb.org</a><br />
<a title="Project Voldemort" href="http://project-voldemort.com" target="_self">http://project-voldemort.com</a></p>
<p>Or, a hybrid solutions such as HadoodDB:<br />
<a title="HadoopDB" href="http://db.cs.yale.edu/hadoopdb/hadoopdb.html" target="_self">http://db.cs.yale.edu/hadoopdb/hadoopdb.html</a></p>
<p>Switching gears back to Oracle&#8217;s acquisition of SUN: Business Week magazine came out with an article clarifying why the European Union will not stop Oracle from taking over SUN:</p>
<p><a title="Why Europe Won't Stop Oracle from taking over SUN" href="http://www.businessweek.com/technology/content/sep2009/tc2009093_421812.htm" target="_self">http://www.businessweek.com/technology/content/sep2009/tc2009093_421812.htm</a></p>
<p>Let&#8217;s all keep our eyes open for January 19, 2010, as the EU&#8217;s set date for making a final decision on the case which will determine if Oracle goes forward with MySQL (hopefully) as a hybrid solution &#8212; like HadoopDB.</p>
<p>I enjoy using open-source technologies, but Oracle pumps out extraordinary enterprise-ready solutions that far outweigh free alternatives by features, documentation, and security.</p>
<p>Take what I say with a grain of salt; I am no guru, of course. I am simply enthused to see competition which will hopefully benefit programmers and end-users.</p>
<p><strong>Reference</strong></p>
<p>[1] Oracle Poster on SUN Acquisition to Customers: <a title="TechCrunch Original Poster" href="http://cache0.techcrunch.com/wp-content/uploads/2009/09/oracle.jpg" target="_self">http://cache0.techcrunch.com/wp-content/uploads/2009/09/oracle.jpg</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Swinger is cool, Sammy looks cooler]]></title>
<link>http://contourline.wordpress.com/2009/09/30/swinger-is-cool-sammy-looks-cooler/</link>
<pubDate>Thu, 01 Oct 2009 06:02:57 +0000</pubDate>
<dc:creator>jmarca</dc:creator>
<guid>http://contourline.wordpress.com/2009/09/30/swinger-is-cool-sammy-looks-cooler/</guid>
<description><![CDATA[Just tried out swinger.  It is cool.  But I can&#8217;t get authorization to work right using the tr]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Just tried out <a href="http://github.com/quirkey/swinger" target="_blank">swinger</a>.  It is cool.  But I can&#8217;t get authorization to work right using the trunk checkout of couch (0.11.blahblah_git).  Something to hack on</p>
<p>But I&#8217;m more interested in playing with <a href="http://github.com/quirkey/sammy" target="_blank">Sammy.js</a>.  The two application stack figures on the <a href="http://www.quirkey.com/blog/2009/09/15/sammy-js-couchdb-and-the-new-web-architecture/" target="_blank">blog page</a> (and in the Swinger slides) are interesting.  Take away the couchdb bit, add Sakai&#8217;s K2, and you&#8217;ve got a very similar picture.  Sure couchdb can serve the app with attachments to the _design doc, but that&#8217;s not the point.  The point is being able to stick documents into a db and then get them out again in interesting ways without having to bend over backwards on the server side.</p>
<p>But again, I have to play with it for a while and see what it can do.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[google tech talk: couchdb]]></title>
<link>http://xylld.wordpress.com/2009/09/30/google-tech-talk-couchdb/</link>
<pubDate>Wed, 30 Sep 2009 17:19:45 +0000</pubDate>
<dc:creator>xylld</dc:creator>
<guid>http://xylld.wordpress.com/2009/09/30/google-tech-talk-couchdb/</guid>
<description><![CDATA[If you haven&#8217;t heard of it and are interested about scaling databases of any sort or making fa]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>If you haven&#8217;t heard of it and are interested about scaling databases of any sort or making fast web applications, I highly suggest you check out this video.  On a high level, it sounds like everything you&#8217;ve ever wanted out of a database.  I am absolutely thrilled about it and am excited to start using it.  It also gives me a good reason to actually get down to learning a functional language, erlang.</p>
<p><a href="http://www.youtube.com/watch?v=ESDBM9-U804">http://www.youtube.com/watch?v=ESDBM9-U804</a></p>
<p>Let me know what you think!</p>
</div>]]></content:encoded>
</item>

</channel>
</rss>
