<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>semi-supervised-learning &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/semi-supervised-learning/</link>
	<description>Feed of posts on WordPress.com tagged "semi-supervised-learning"</description>
	<pubDate>Fri, 24 May 2013 04:48:59 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[A spectrum of graphs]]></title>
<link>http://magsol.wordpress.com/2012/08/21/a-spectrum-of-graphs/</link>
<pubDate>Tue, 21 Aug 2012 21:31:54 +0000</pubDate>
<dc:creator>magsol</dc:creator>
<guid>http://magsol.wordpress.com/2012/08/21/a-spectrum-of-graphs/</guid>
<description><![CDATA[Adequately describes how I&#8217;ve felt the last couple weeks. Millions of miles away, the Curiosit]]></description>
<content:encoded><![CDATA[<div id="attachment_1473" class="wp-caption aligncenter" style="width: 330px"><a href="http://magsol.files.wordpress.com/2012/08/ie_aqkohkug1cod76vogsw2.gif"><img class="size-full wp-image-1473" title="ie_aqkOhKUG1cod76voGSw2" src="http://magsol.files.wordpress.com/2012/08/ie_aqkohkug1cod76vogsw2.gif?w=320&#038;h=228" alt="" width="320" height="228" /></a><p class="wp-caption-text">Adequately describes how I&#8217;ve felt the last couple weeks.</p></div>
<p>Millions of miles away, the Curiosity rover is going through an <a href="http://blogs.discovermagazine.com/badastronomy/2012/08/21/curiosity-spins-its-wheels/" target="_blank">exhaustive battery of tests</a> to make sure everything&#8217;s kosher before embarking on what will almost surely be the most thorough and informative Mars exploration to date.</p>
<p>Meanwhile, we&#8217;re stuck on this planet with <a href="http://www.huffingtonpost.com/eve-ensler/todd-akin-rape_b_1812930.html" target="_blank">morons like this guy</a> (irony of ironies: he&#8217;s on the Congressional Science Committee. <a href="http://science.house.gov/about/membership" target="_blank">NO JOKE</a>).</p>
<p>In case I&#8217;ve been too subtle: yes, it&#8217;s been a ball-busting couple of weeks in the &#8216;burgh. Work has been extremely nose-to-the-grindstone, though I&#8217;m optimistic about how the impending fall semester will pan out. Nevertheless, here is a glimpse into the work that&#8217;s been keeping me up late:</p>
<p><a href="http://magsol.files.wordpress.com/2012/08/veggie-cat-food_1_thumb.jpg"><img class="aligncenter size-full wp-image-1474" title="veggie-cat-food_1_thumb" src="http://magsol.files.wordpress.com/2012/08/veggie-cat-food_1_thumb.jpg?w=96&#038;h=96" alt="" width="96" height="96" /></a></p>
<p>Yep!</p>
<p>Ok but really: I&#8217;ve been working on implementing a shiny new <a href="http://en.wikipedia.org/wiki/Semi-supervised_learning" target="_blank">semi-supervised</a> clustering algorithm, the basic idea being that we know what <em>some</em> of the data is, but the vast majority of it is of an unknown nature. In the case of the images I&#8217;m using to test it: can I separate the foreground from the background? The catch here is that I&#8217;m giving the algorithm the entire image, but only a handful of pixels are labeled as actual &#8220;foreground&#8221; or &#8220;background&#8221; pixels; it has to figure out the rest on its own.</p>
<p>What&#8217;s super cool about machine learning with images is that you get stuff like this:</p>
<table width="100%" border="0" cellspacing="0" cellpadding="5">
<tbody>
<tr>
<td><a href="http://magsol.files.wordpress.com/2012/08/eigenvector1.png"><img class="aligncenter size-medium wp-image-1475" title="eigenvector1" src="http://magsol.files.wordpress.com/2012/08/eigenvector1.png?w=300&#038;h=226" alt="" width="300" height="226" /></a></td>
<td><a href="http://magsol.files.wordpress.com/2012/08/eigenvector2.png"><img class="aligncenter size-medium wp-image-1476" title="eigenvector2" src="http://magsol.files.wordpress.com/2012/08/eigenvector2.png?w=300&#038;h=226" alt="" width="300" height="226" /></a></td>
</tr>
<tr>
<td><a href="http://magsol.files.wordpress.com/2012/08/eigenvector3.png"><img class="aligncenter size-medium wp-image-1477" title="eigenvector3" src="http://magsol.files.wordpress.com/2012/08/eigenvector3.png?w=300&#038;h=226" alt="" width="300" height="226" /></a></td>
<td><a href="http://magsol.files.wordpress.com/2012/08/eigenvector4.png"><img class="aligncenter size-medium wp-image-1478" title="eigenvector4" src="http://magsol.files.wordpress.com/2012/08/eigenvector4.png?w=300&#038;h=224" alt="" width="300" height="224" /></a></td>
</tr>
<tr>
<td colspan="2"><a href="http://magsol.files.wordpress.com/2012/08/eigenvector5.png"><img class="aligncenter size-medium wp-image-1479" title="eigenvector5" src="http://magsol.files.wordpress.com/2012/08/eigenvector5.png?w=300&#038;h=226" alt="" width="300" height="226" /></a></td>
</tr>
</tbody>
</table>
<p>Those familiar with this sort of thing will likely recognize the above images as eigenvectors of the original image. For any sort of spectral analysis where graphs are involved, finding eigenvectors is pretty much par for the course. It&#8217;s how you <em>use</em> the eigenvectors that makes each algorithm distinct. I&#8217;ve been working on implementing this one, ideally smoothing its implementation to the point where I can submit it to <a href="http://scikit-learn.org/" target="_blank">these guys</a>.</p>
<p>And then put it into Mahout. But that&#8217;s still a ways away (and I have other work to do there that&#8217;s more pressing than adding new algorithms).</p>
<p>Ultimately, I get this:</p>
<p><a href="http://magsol.files.wordpress.com/2012/08/clusters.png"><img class="aligncenter size-full wp-image-1480" title="clusters" src="http://magsol.files.wordpress.com/2012/08/clusters.png?w=640&#038;h=482" alt="" width="640" height="482" /></a></p>
<p>It&#8217;s not ideal, particularly since the few labeled pixels I handed off to the algorithm included the bowl as &#8220;background&#8221;, and yet the algorithm decided it knew better than I did and included it as foreground. So there are still some things to iron out.</p>
<p>But this is kinda neat! <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>We&#8217;re hoping to get a paper out in the next couple weeks on this. Sadly, it won&#8217;t involve cats; rather, our goal is to deploy it on small molecules. While I&#8217;m not structural biology&#8217;s biggest fan (read: I leave the seminar when I learn it&#8217;s on structural biology), this particular application is really interesting to me.</p>
<p>And yes, one of my to-dos is to optimize the algorithm a bit more, so it can handle images that are bigger than little thumbnails. As in most spectral algorithms, one needs to create a graph matrix that is #pixels-by-#pixels. For 50&#215;50 images (like the little cat above), this is only 2500 total pixels, resulting in a graph matrix that&#8217;s 2500&#215;2500. That&#8217;s only 6.2 million numbers, which will fit in less than 1GB of memory. But for even slightly larger images&#8211;say, <a href="http://knithacker.com/html/knithacker/wp-content/uploads/2012/03/lolcat1.jpg" target="_blank">640&#215;480</a>&#8211;now the number of pixels is 307k, resulting in a graph matrix that holds more than 9.4 <em>billion</em> numbers. Most home computers right now only have 4 or 8GB of memory, and would instantly freeze when attempting to create the graph matrix.</p>
<p><a href="http://www.youtube.com/watch?v=BfGkhhm4vXw&#38;feature=player_detailpage#t=24s" target="_blank">HOORAY MATHS</a>!</p>
<p>Other exciting news? September is almost here, which meaaaaaans:</p>
<ul>
<li>Sept 1: <a href="http://runforyourlives.com/locations/pittsburgh-overview/" target="_blank">Zombie 5k</a></li>
<li>Sept 15: <a href="http://www.usafmarathon.com/" target="_blank">USAF half marathon</a></li>
<li>Sept 21-22: <a href="http://ragnarrelay.com/race/dc/" target="_blank">D.C Ragnar</a></li>
<li>Sept 30: <a href="http://www.rungreatrace.com/" target="_blank">Great Race 10k</a></li>
</ul>
<p>I am trying to get back on a regular posting regimen. I&#8217;m thinking once a week on the same day each week. How do Tuesdays sound?</p>
<p>When the stress levels get too high, I just look at this picture. Enjoy!</p>
<p><a href="http://magsol.files.wordpress.com/2012/08/b00a3f04-4bdc-4732-9e32-4e012402cc7a.jpeg"><img class="aligncenter size-full wp-image-1481" title="b00a3f04-4bdc-4732-9e32-4e012402cc7a" src="http://magsol.files.wordpress.com/2012/08/b00a3f04-4bdc-4732-9e32-4e012402cc7a.jpeg?w=500&#038;h=375" alt="" width="500" height="375" /></a></p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Semi-Supervised Shape Classification with Manifold Regularization]]></title>
<link>http://snikolov.wordpress.com/2012/08/20/semi-supervised-shape-classification-with-manifold-regularization/</link>
<pubDate>Mon, 20 Aug 2012 04:59:18 +0000</pubDate>
<dc:creator>snikolov</dc:creator>
<guid>http://snikolov.wordpress.com/2012/08/20/semi-supervised-shape-classification-with-manifold-regularization/</guid>
<description><![CDATA[For my Statistical Learning Theory class I did a project on shape classification using manifold regu]]></description>
<content:encoded><![CDATA[<p>For my <a href="http://www.mit.edu/~9.520/">Statistical Learning Theory class</a> I did a project on shape classification using manifold regularization. You can read the abstract below. You can also find the paper <a href="https://github.com/snikolov/shape-manifold/blob/master/doc/shape.pdf">here</a> and the code <a href="https://github.com/snikolov/shape-manifold">here</a>.</p>
<blockquote><p>We approach the problem of semi-supervised shape classification by exploiting the geometric structure of shape data. We apply manifold regularization to learn a function from shapes to class labels. Central to manifold regularization algorithms is the use of a weighted graph to represent pairwise relationships between training points and capture their geometric structure. Under a regularized least squares formulation, each algorithm only involves solving a linear system of equations. We analyze the classification performance for different features regularization parameters, percentage of labeled data, and noise in the labels. We show that encouraging the smoothness of the classification function on the manifold improves classification performance beyond simply encouraging smoothness in the ambient space. We compare raw pixel features to Signed Distance Function (SDF) features and find that SDF features yield consistently higher accuracy. Finally, we compare manifold regularization to the much simpler k-Nearest-Neighbors and show that manifold regularization is consistently better.</p>
<p>Keywords: shape classification, manifold regularization, semi-supervised learning.</p></blockquote>
]]></content:encoded>
</item>
<item>
<title><![CDATA[CVPR2010: Semi-Supervised Learning in Vision]]></title>
<link>http://courze.wordpress.com/2011/05/16/cvpr2010-semi-supervised-learning-in-vision/</link>
<pubDate>Mon, 16 May 2011 04:30:07 +0000</pubDate>
<dc:creator>anonymous</dc:creator>
<guid>http://courze.wordpress.com/2011/05/16/cvpr2010-semi-supervised-learning-in-vision/</guid>
<description><![CDATA[slides on slideshare.net homepage Semi-Supervised Learning in Vision, CVPR 2010 Tutorial We are orga]]></description>
<content:encoded><![CDATA[<p>slides on slideshare.net</p>
<p><a href="http://www.icg.tugraz.at/Members/Saffari/ssl-cvpr2010" target="_blank">homepage</a></p>
<h3 id="parent-fieldname-title">Semi-Supervised Learning in Vision, CVPR 2010 Tutorial</h3>
<pre class="brush: plain; collapse: true; gutter: false; light: false; title: ; toolbar: true; notranslate" title="">We are organizing a tutorial at CVPR 2010 on &#34;Semi-Supervised Learning in Vision&#34;. The following is a brief introduction and a set of topics we will cover in the tutorial. Please give us your feedback regarding this tutorial, for example, if you would like to see a method or topic which is not covered by the following list, or if you would like to see a topic discussed in more details, please let us know.

We also plan to publish a set of open/closed source software available for semi-supervised learning. So if you would like to have your package included here, please send us a short description of your method and a link to where it can be obtained.</pre>
<h3>Introduction</h3>
<pre class="brush: plain; collapse: true; gutter: false; light: false; title: ; toolbar: true; notranslate" title="">Current supervised approaches obtain high recognition rates if enough labeled training data is available. However, for most practical problems there is simply not enough labeled data available, whereas hand-labeling is tedious and expensive, in some cases not even feasible. This is especially true for applications in computer vision like object recognition and categorization from images and videos, where the human effort is needed to determine the true contents of the media. Semi-supervised methods offer an interesting solution to this problem by learning from both labeled and unlabeled data. These methods try to give an answer to the question: “How to improve classification accuracy using unlabeled data together with the labeled data?”.

This course will cover an introduction to semi-supervised learning, its applications in computer vision, and the open problems and challenges facing the future research in this field. Nowadays, the Internet offers a huge amount of data in form of unlabeled data (or labeled with high degree of uncertainty), and learning from Internet is becoming more and more widespread in computer vision. Therefore, we will have a special focus on issues dealing with large-scale and on-line semi-supervised learning tasks.

The first half of the course will address the basics of semi-supervised learning and its relations to other machine learning domains. It will cover the major works in this field from a unified point of view, and will discuss the advantages and disadvantages of these methods from theoretical and application perspectives. In the second part, we will focus on the applications of the semi-supervised learning in computer vision, and the open challenges and gaps in existing methods.</pre>
<h3>Slides</h3>
<ol>
<li>Introductions and motivations: Horst Bischof [<a href="http://www.ymer.org/papers/files/2010-cvpr-ssl-slides-intro.pdf">slides</a>] (the <a href="http://www.slideshare.net/zukun/cvpr2010-semisupervised-learning-in-vision-part-1-introduction" target="_blank">slides</a> on slideshare.net)</li>
<li>Theory of Semi-Supervised Learning: Amir Saffari [<a href="http://www.ymer.org/papers/files/2010-cvpr-ssl-slides-theory.tar.gz">slides</a>] (the <a href="http://www.slideshare.net/zukun/cvpr2010-semisupervised-learning-in-vision-part-2-theory" target="_blank">slides</a> on slideshare.net)</li>
<li>Algorithms and Applications: Christian Leistner [<a href="http://www.ymer.org/papers/files/2010-cvpr-ssl-slides-algorithms.tar.gz">slides</a>] (the <a href="http://www.slideshare.net/zukun/cvpr2010-semisupervised-learning-in-vision-part-3-algorithms-and-applications" target="_blank">slides</a> on slideshare.net)</li>
</ol>
<h3>Detailed Outline</h3>
<pre class="brush: plain; collapse: true; gutter: false; light: false; title: ; toolbar: true; notranslate" title="">
   1. Motivations.
   2. A unified perspecive on learning from unlabeled data.
   3. Statistical models for semi-supervised learning.
   4. Semi-supervised learning with margins.
   5. Methods based on manifold learning.
   6. Co-training and learning from multiple-views.
   7. Semi-supervised learning for large-scale problems.
   8. The relations to other fields: unsupervised learning, one-shot learning, transfer learning, and multiple instance learning.
   9. Applications: object recognition and categorization.
  10. Applications: object segmentation.
  11. Applications: object tracking.
  12. Applications: activity recognition.
  13. Open problems and challenges.
</pre>
<h3>Target Audience</h3>
<pre class="brush: plain; collapse: true; gutter: false; light: false; title: ; toolbar: true; notranslate" title="">The course will be interesting for those who investigate or apply learning methods in computer vision. The course is directed towards the researchers, practitioners and PhD students working on related topics. The tutorial is self-contained in the sense that it requires a minimum knowledge of basic mathematical concepts, such as statistics and linear algebra.</pre>
<h3>Organizers</h3>
<ul>
<li><a href="http://www.ymer.org/amir/">Amir Saffari</a>, Graz University of Technology</li>
<li><a href="http://www.icg.tugraz.at/Members/leistner">Christian Leistner</a>, Graz University of Technology</li>
<li><a href="http://www.icg.tugraz.at/Members/author/bischof">Horst Bischof</a>, Graz University of Technology</li>
</ul>
]]></content:encoded>
</item>
<item>
<title><![CDATA[[ArXiv] classifying spectra]]></title>
<link>http://hyunsook.wordpress.com/2009/10/23/arxiv-classifying-spectra/</link>
<pubDate>Fri, 23 Oct 2009 17:36:43 +0000</pubDate>
<dc:creator>HLee</dc:creator>
<guid>http://hyunsook.wordpress.com/2009/10/23/arxiv-classifying-spectra/</guid>
<description><![CDATA[[arXiv:stat.ME:0910.2585] Variable Selection and Updating In Model-Based Discriminant Analysis for H]]></description>
<content:encoded><![CDATA[<blockquote><p>
<a href="http://arxiv.org/abs/0910.2585">[arXiv:stat.ME:0910.2585]</a><br />
Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications<br />
by <i>Murphy, Dean, and Raftery</i>
</p></blockquote>
<p>Classifying or clustering (or semi supervised learning) spectra is a very challenging problem from collecting statistical-analysis-ready data to reducing the dimensionality without sacrificing complex information in each spectrum. Not only how to estimate spiky (not differentiable) curves via statistically well defined procedures of estimating equations but also how to transform data that match the regularity conditions in statistics is challenging.<br />
<!--more--><br />
Another reason that astrophysics spectroscopic data classification and clustering is more difficult is that observed lines, and their intensities and FWHMs on top of continuum are related to atomic database and latent variables/hyper parameters (distance, rotation, absorption, column density, temperature, metalicity, types, system properties, etc). Frequently it becomes very challenging mixture problem  to separate lines and to separate lines from continuum (boundary and identifiability issues). These complexity only appears in astronomy spectroscopic data because we only get indirect or uncontrolled data ruled by physics, as opposed to the the meat species spectra in the paper. These spectroscopic data outside astronomy are rather smooth, observed in controlled wavelength range, and no worries for correcting recession/radial velocity/red shift/extinction/lensing/etc.</p>
<p>Although the most relevant part to astronomers, i.e. spectroscopic data processing is not discussed in this paper, the most important part, statistical learning application to complex curves, spectral data, is well described. Some astronomers with appropriate data would like to try the variable selection strategy and to check out the classification methods in statistics. If it works out, it might save space for storing spectral data and time to collect high resolution spectra. Please, keep in mind that it is not necessary to use the same variable selection strategy. Astronomers can create better working versions for classification and clustering purpose, like Hardness Ratios, often used to reduce the dimensionality of spectral data since low total count spectra are not informative in the full energy (wavelength) range. Curse of dimensionality!. </p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[The redundancy of view-redundancy for co-training]]></title>
<link>http://mlstat.wordpress.com/2009/08/23/the-redundancy-of-view-redundancy-for-co-training/</link>
<pubDate>Sun, 23 Aug 2009 22:08:38 +0000</pubDate>
<dc:creator>mlstat</dc:creator>
<guid>http://mlstat.wordpress.com/2009/08/23/the-redundancy-of-view-redundancy-for-co-training/</guid>
<description><![CDATA[Blum and Mitchell&#8217;s co-training is a (very deservedly) popular semi-supervised learning algori]]></description>
<content:encoded><![CDATA[<p><a href="http://www.cs.cmu.edu/~avrim/Papers/cotrain.pdf" target="_blank">Blum and Mitchell&#8217;s </a><em><a href="http://www.cs.cmu.edu/~avrim/Papers/cotrain.pdf" target="_blank">co-training</a></em> is a (very deservedly) popular semi-supervised learning algorithm that relies on class-conditional feature independence, and view-redundancy (or view-agreement) for semi-supervised learning.</p>
<p>I will argue that the view-redundancy assumption is unnecessary, and along the way show how surrogate learning can be plugged into co-training  (which is not all that surprising considering that both are multi-view semi-sup algorithms that rely on class-conditional view-independence).</p>
<p>I&#8217;ll first explain co-training with an example.</p>
<p><strong> Co-training &#8211; The setup</strong></p>
<p>Consider a <img src='http://s0.wp.com/latex.php?latex=y+%5Cin+%5C%7B0%2C1%5C%7D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y &#92;in &#92;{0,1&#92;}' title='y &#92;in &#92;{0,1&#92;}' class='latex' /> classification problem on the feature space <img src='http://s0.wp.com/latex.php?latex=%5Cmathcal%7BX%7D%3D%5Cmathcal%7BX%7D_1+%5Ctimes+%5Cmathcal%7BX%7D_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='&#92;mathcal{X}=&#92;mathcal{X}_1 &#92;times &#92;mathcal{X}_2' title='&#92;mathcal{X}=&#92;mathcal{X}_1 &#92;times &#92;mathcal{X}_2' class='latex' />. I.e., a feature vector <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x' title='x' class='latex' /> can be split into two as <img src='http://s0.wp.com/latex.php?latex=x+%3D+%5Bx_1%2C+x_2%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x = [x_1, x_2]' title='x = [x_1, x_2]' class='latex' />.</p>
<p>We make the rather restrictive assumption that <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> are class-conditionally independent for both classes. I.e., <img src='http://s0.wp.com/latex.php?latex=P%28x_1%2C+x_2%26%23124%3By%29+%3D+P%28x_1%26%23124%3By%29+P%28x_2%26%23124%3By%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(x_1, x_2&#124;y) = P(x_1&#124;y) P(x_2&#124;y)' title='P(x_1, x_2&#124;y) = P(x_1&#124;y) P(x_2&#124;y)' class='latex' /> for <img src='http://s0.wp.com/latex.php?latex=y+%5Cin+%5C%7B0%2C1%5C%7D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y &#92;in &#92;{0,1&#92;}' title='y &#92;in &#92;{0,1&#92;}' class='latex' />.</p>
<p>(Note that unlike <a href="http://mlstat.wordpress.com/2009/08/07/surrogate-learning-with-mean-independence/">surrogate learning with mean-independence</a>, both <img src='http://s0.wp.com/latex.php?latex=%5Cmathcal%7BX%7D_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='&#92;mathcal{X}_1' title='&#92;mathcal{X}_1' class='latex' />  and <img src='http://s0.wp.com/latex.php?latex=%5Cmathcal%7BX%7D_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='&#92;mathcal{X}_2' title='&#92;mathcal{X}_2' class='latex' /> are allowed to be multi-dimensional.)</p>
<p>Co-training makes an additional assumption that either view is sufficient for classification. This <em>view-redundancy</em> assumption basically states that the probability mass in the region of the feature space, where the Bayes optimal classifiers on the two views disagree with each other, is zero.</p>
<p>(The original co-training paper actually relaxes this assumption in the epilogue, but it is unnecessary to begin with, and the assumption has proliferated in later manifestations of co-training.)</p>
<p>We are given some labeled data (or a weak classifier on one of the views) and an large supply of unlabeled data. We are now ready to proceed with co-training to construct a Bayes optimal classifier.</p>
<p><strong>Co-training &#8211; The algorithm</strong></p>
<p>The algorithm is very simple. We use our weak classifier, say <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' />, (which we were given, or which we constructed using the measly labeled data) on the one view (<img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' />) to classify all the unlabeled data.  We select the examples classified with high confidence, and use these as labeled examples (using the labels assigned by the weak classifier) to train a classifier <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> on the other view (<img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />).</p>
<p>We now classify the unlabeled data with <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> to similarly generate labeled data to retrain <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' />. This back-and-forth procedure is repeated until exhaustion.</p>
<p>Under the above assumptions (and with &#8220;sufficient&#8221; unlabeled data) <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1' title='h_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=h_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2' title='h_2' class='latex' /> converge to the Bayes optimal classifiers on the respective feature views. Since either view is enough for classification, we just pick one of the classifiers and release it into the wild.</p>
<p><strong>Co-training &#8211; Why does it work?</strong></p>
<p>I&#8217;ll try to present an intuitive explanation of co-training using the example depicted in the following figure. Please focus on it intently.</p>
<p><img class="alignnone size-full wp-image-232" title="co-training" src="http://mlstat.files.wordpress.com/2009/08/co-training1.png?w=450&#038;h=299" alt="co-training" width="450" height="299" /></p>
<p>The feature vector <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x' title='x' class='latex' /> in the example is 2-dimensional and both views <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> are  1-dimensional. The class-conditional distributions are uncorrelated and jointly Gaussian (which means independent) and depicted by their equiprobability contours in the figure. The marginal class-conditional distributions are show along the two axes. Class <img src='http://s0.wp.com/latex.php?latex=y%3D0&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y=0' title='y=0' class='latex' /> is shown in red and class <img src='http://s0.wp.com/latex.php?latex=y%3D1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y=1' title='y=1' class='latex' /> is shown in blue. The picture also shows some unlabeled examples.</p>
<p>Assume we have a weak classifier <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' /> on the first view. If we extend the classification boundary for this classifier to the entire space <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x' title='x' class='latex' />,  the boundary necessarily comprises of lines parallel to the <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> axis.  Let&#8217;s say there is only one such line and all the examples below that line are assigned class <img src='http://s0.wp.com/latex.php?latex=y%3D1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y=1' title='y=1' class='latex' /> and all the examples above are assigned class <img src='http://s0.wp.com/latex.php?latex=y%3D0&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y=0' title='y=0' class='latex' />.</p>
<p>We now ignore all the examples close to the classification boundary of <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1' title='h_1' class='latex' /> (i.e., all the examples in the grey band) and project the rest of the points onto the <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> axis.</p>
<p>How will these projected points be distributed along <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />?</p>
<p>Since the examples that were ignored (in the grey band) were selected based on their <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> values, owing to class-conditional independence, the marginal distribution along <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> for either class will be <em>exactly</em> the same as if none of the samples were ignored. This is the key reason for the conditional-independence assumption.</p>
<p>The procedure has two subtle, but largely innocuous, consequences.</p>
<p>First, since we don&#8217;t know how many class <img src='http://s0.wp.com/latex.php?latex=0&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='0' title='0' class='latex' /> and class <img src='http://s0.wp.com/latex.php?latex=1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='1' title='1' class='latex' /> examples are in the grey band the relative ratio of the examples of the two classes in the not-ignored set may not the same as in the original full unlabeled sample set. If the class priors <img src='http://s0.wp.com/latex.php?latex=P%28y%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(y)' title='P(y)' class='latex' /> are known, this can easily be corrected for when we learn <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' />. If the class priors are unknown other assumptions on <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' /> are necessary.</p>
<p>Second, when we project the unlabeled examples on to <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> we assign them the labels given to them by <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1' title='h_1' class='latex' /> which can be erroneous. In the figure above, there will be examples in the region indicated by A that are actually class <img src='http://s0.wp.com/latex.php?latex=1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='1' title='1' class='latex' /> but have been assigned class <img src='http://s0.wp.com/latex.php?latex=0&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='0' title='0' class='latex' />, and examples in region B that were from class <img src='http://s0.wp.com/latex.php?latex=0&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='0' title='0' class='latex' /> but were called class <img src='http://s0.wp.com/latex.php?latex=1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='1' title='1' class='latex' />.</p>
<p>Again because of the class-conditional independence assumption these erroneously labeled examples will be distributed according to the marginal class-conditional <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> distributions. I.e., in the figure above we imagine, along the <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> axis, a very low amplitude blue distribution with the same shape and location as the red distribution, and a very low amplitude red distribution with the same shape under the blue distribution. (Note . This is the <img src='http://s0.wp.com/latex.php?latex=%28%5Calpha%2C+%5Cbeta%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='(&#92;alpha, &#92;beta)' title='(&#92;alpha, &#92;beta)' class='latex' /> noise in the original co-training paper.)</p>
<p>This amounts to having a labeled training set with label errors but with errors being generated <em>independently</em> of the location in the space. That is the number of errors in a region in the space is proportional to the number of examples in that region. These proportionally distributed errors are then washed out by the correctly labeled examples when we learn <img src='http://s0.wp.com/latex.php?latex=h_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2' title='h_2' class='latex' />.</p>
<p>To recap, co-training works because of the following fact. Starting from a weak classifier <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1' title='h_1' class='latex' /> on <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' />, we can generate very accurate and <em>unbiased</em> training data to train a classifier on <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />.</p>
<p><strong>No need for view-redundancy</strong></p>
<p>Notice that, in the above example, we made no appeal to any kind of view-redundancy (other than whatever we may get gratis from the independence assumption).</p>
<p>The vigilant reader may however level the following two objections against the above argument-by-example.</p>
<p>1. We build <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> separately. So when the training is done, without view redundancy, we have not shown a way to pick from the two to apply to new test data.</p>
<p>2. At every iteration we need to select unlabeled samples that were classified with <em>high</em>-confidence by <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1' title='h_1' class='latex' /> to feed to the trainer for <img src='http://s0.wp.com/latex.php?latex=h_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2' title='h_2' class='latex' />. Without view-redundancy may be <em>none</em> of the samples will be classified with high confidence.</p>
<p>The first objection is easy to respond to. We pick neither <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1' title='h_1' class='latex' /> nor <img src='http://s0.wp.com/latex.php?latex=h_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2' title='h_2' class='latex' /> for new test data. Instead we combine them to obtain a classifier <img src='http://s0.wp.com/latex.php?latex=h%28x_1%2Cx_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h(x_1,x_2)' title='h(x_1,x_2)' class='latex' />. This is well justified because, under class-conditional independence, <img src='http://s0.wp.com/latex.php?latex=P%28y%26%23124%3Bx_1%2Cx_2%29+%5Cpropto+P%28y%26%23124%3Bx_1%29+P%28y%26%23124%3Bx_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(y&#124;x_1,x_2) &#92;propto P(y&#124;x_1) P(y&#124;x_2)' title='P(y&#124;x_1,x_2) &#92;propto P(y&#124;x_1) P(y&#124;x_2)' class='latex' />.</p>
<p>We react to the second objection by dropping the requirement of classifying with high-confidence altogether.</p>
<p><strong>Dropping the high-confidence requirement by surrogate learning</strong></p>
<p>Instead of training <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> with examples that are classified with high confidence by <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' />, we train <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> with all the examples (using the scores assigned to them by <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' />).</p>
<p>At some iteration of co-training, define the random variable <img src='http://s0.wp.com/latex.php?latex=z_1+%3D+h_1%28x_1%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='z_1 = h_1(x_1)' title='z_1 = h_1(x_1)' class='latex' />. Since <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> are class-conditionally independent, <img src='http://s0.wp.com/latex.php?latex=z_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='z_1' title='z_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> are also class-conditionally independent. In particular <img src='http://s0.wp.com/latex.php?latex=z_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='z_1' title='z_1' class='latex' />  is class-conditionally <em>mean-independent</em> of <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />. Furthermore if <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='h_1' title='h_1' class='latex' /> is even a weakly useful classifier, barring pathologies, it will satisfy <img src='http://s0.wp.com/latex.php?latex=E%5Bz_1%26%23124%3By%3D0%5D+%5Cneq+E%5Bz_1%26%23124%3By%3D1%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[z_1&#124;y=0] &#92;neq E[z_1&#124;y=1]' title='E[z_1&#124;y=0] &#92;neq E[z_1&#124;y=1]' class='latex' />.</p>
<p>We can therefore apply <a href="http://http://mlstat.wordpress.com/2009/08/07/surrogate-learning-with-mean-independence/" target="_self">surrogate learning under mean-independence</a> to learn the classifier on <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />. (This is essentially the same idea as <a href="http://www.informatik.uni-freiburg.de/cgnm/lehre/pm-05s/bib/multi-view/Nigam2000-effectiveness-and-applicability-of-cotraining.pdf" target="_blank">Co-EM</a>, which was introduced without much theoretical justification.)</p>
<p><strong>Discussion</strong></p>
<p>Hopefully the above argument has convinced the reader that the class-conditional view independence assumption obviates the view-redundancy requirement.</p>
<p>A natural question to ask is whether the reverse is true. That is, if we are given view-redundancy, can we completely eliminate the requirement of class-conditional independence? We can immediately see that the answer is no.</p>
<p>For example, we can duplicate all the features for any classification problem so that view-redundancy holds trivially between the two replicates. Moreover, the second replicate will be statistically fully dependent on the first.</p>
<p>Now if we are given a weak classifier on the first view (or replicate) and try to use its predictions on an unlabeled data set to obtain training data for the second, it would be equivalent to feeding back the predictions of a classifier to retrain itself (because the two views are duplicates of one another).</p>
<p>This type of procedure (which is <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.77.6032" target="_blank">an idea decades old</a>) has been called, among other things, self-learning, self-correction, self-training and decision-directed adaptation. The problem with these approaches is that the training set so generated is <em>biased</em> and other assumptions are necessary for the feedback procedure to improve over the original classifier.</p>
<p>Of course this does not mean that the complete statistical independence assumption cannot be relaxed. The above argument only shows that at least <em>some amount</em> of independence is necessary.</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[A surrogate learning mystery]]></title>
<link>http://mlstat.wordpress.com/2009/08/11/a-surrogate-learning-mystery/</link>
<pubDate>Wed, 12 Aug 2009 02:03:23 +0000</pubDate>
<dc:creator>mlstat</dc:creator>
<guid>http://mlstat.wordpress.com/2009/08/11/a-surrogate-learning-mystery/</guid>
<description><![CDATA[I&#8217;ll present an application of the surrogate learning idea in the previous post. It is mildly]]></description>
<content:encoded><![CDATA[<p>I&#8217;ll present an application of the surrogate learning idea in the <a href="http://mlstat.wordpress.com/2009/08/07/surrogate-learning-with-mean-independence/" target="_self">previous post</a>. It is mildly surprising at first blush, which I&#8217;ll contrive to make more mysterious.</p>
<p><strong>Murders she induced</strong></p>
<p>For readers who appreciate this sort of a thing, here&#8217;s the grandmother of all murder mysteries. In this one Miss Marple solves a whole bunch of unrelated murders all at once.</p>
<p>Miss Marple having finally grown wise to the unreliability of feminine intuition in solving murder mysteries spent some time learning statistics. She then convinced one of her flat-footed friends at the Scotland Yard to give her the files on all the unsolved murders, that were just sitting around gathering dust and waiting for someone with her imagination and statistical wiles.</p>
<p>She came home with the massive pile of papers and sat down to study them.</p>
<p>The file on each murder carefully listed all the suspects, with their possible motives, their accessibility to the murder weapon, psychological characteristics, previous convictions and many other details. There was a large variety in the number of suspects for different murders.</p>
<p>Furthermore, for every case the investigating officer made a note that the murderer was very likely to be part of the suspect pool. Only, it was not possible to narrow down the pool to one, and none of the suspects confessed.</p>
<p>Thinking statistically, Miss Marple decided to encode the characteristics of each suspect by a feature vector. Drawing on her vast experience in these matters she assigned numeric values to features like motives, psychology, relationship to the victim and access to the weapon.</p>
<p>All that was left was to build a statistical model that predicted the probability of a suspect being the murderer from his feature vector.  She would then rank all the suspects of a case by this probability, and then she can have Scotland Yard try to get the most likely suspect to confess.</p>
<p>At this point Miss Marple paused for a moment to rue her oversight in not asking for the files on the solved murders as well. She could have used the suspects (and the known murderer) from the solved murders as labeled examples to train her statistical model. However, she soon realized &#8230;</p>
<p>I know you are tense with anticipation of the inevitable twist in the tale. Here it is.</p>
<p><em>&#8230; she soon realized that she could build her model by training a regressor to predict, from the feature vector of a suspect, the size of the suspect pool he belongs to.</em></p>
<p><strong>The clues</strong></p>
<p>Let&#8217;s re-examine the pertinent facts &#8212; 1. every suspect pool contains the murderer, 2.  the suspect pools are of varying sizes.</p>
<p>The one, perhaps controversial, assumption we&#8217;ll need to make is that the feature vectors for both the murderers and the non-murderers are mean-independent of the size of the suspect pool they come from. (However, if it is good enough for Miss Marple, it is good enough for me.)</p>
<p>In any case, the violation of this assumption would imply that, when whether or not a suspect is a murderer is known, knowing his feature vector allows us to better predict the size of the suspect pool he belongs to. Because this is far-fetched, we can believe that the assumption is true.</p>
<p>In fact it may seem that knowing whether or not a suspect is murderer itself adds nothing to the predictability of the size of the pool from which the suspect is drawn. We will however show that the means for the murderers and non-murderers is different.</p>
<p>Let us now see why Miss Marple&#8217;s method works.</p>
<p><strong>Marple math</strong></p>
<p>Let the suspect pools be denoted <img src='http://s0.wp.com/latex.php?latex=%5C%7BS_1%2C+S_2%2C%5Cldots%2CS_k%5C%7D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='&#92;{S_1, S_2,&#92;ldots,S_k&#92;}' title='&#92;{S_1, S_2,&#92;ldots,S_k&#92;}' class='latex' /> with sizes <img src='http://s0.wp.com/latex.php?latex=%26%23124%3BS_i%26%23124%3B+%3D+s_i&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='&#124;S_i&#124; = s_i' title='&#124;S_i&#124; = s_i' class='latex' />. Let each suspect be described by the feature vector denoted <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />. We now define the class label <img src='http://s0.wp.com/latex.php?latex=y%3D1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y=1' title='y=1' class='latex' /> if a particular suspect is a murderer and <img src='http://s0.wp.com/latex.php?latex=y%3D0&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y=0' title='y=0' class='latex' /> otherwise.</p>
<p>We append the feature <img src='http://s0.wp.com/latex.php?latex=x_1+%3D+-s_i&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1 = -s_i' title='x_1 = -s_i' class='latex' /> to the feature vector of a suspect from pool <img src='http://s0.wp.com/latex.php?latex=S_i&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='S_i' title='S_i' class='latex' />. That is, the surrogate feature we use is the negative of the pool size. (The negative is for a technical reason to match the previous post. In practice I would recommend <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7Bs_i%7D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='&#92;frac{1}{s_i}' title='&#92;frac{1}{s_i}' class='latex' />, but this makes the argument below a bit more involved.)</p>
<p>The first thing to note is that the assumption we made above translates to <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%2Cx_2%5D+%3D+E%5Bx_1%26%23124%3By%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y,x_2] = E[x_1&#124;y]' title='E[x_1&#124;y,x_2] = E[x_1&#124;y]' class='latex' /> for both <img src='http://s0.wp.com/latex.php?latex=y%3D0&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y=0' title='y=0' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=y%3D1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y=1' title='y=1' class='latex' />.</p>
<p>All we need to show is that <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D1%5D+%26%2362%3B+E%5Bx_1%26%23124%3By%3D0%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=1] &gt; E[x_1&#124;y=0]' title='E[x_1&#124;y=1] &gt; E[x_1&#124;y=0]' class='latex' /> and we are home free, because the argument in the previous post implies that <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3Bx_2%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;x_2]' title='E[x_1&#124;x_2]' class='latex' /> can be used to rank suspects in a pool just as well as <img src='http://s0.wp.com/latex.php?latex=P%28y%26%23124%3Bx_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(y&#124;x_2)' title='P(y&#124;x_2)' class='latex' />.</p>
<p>So all we need to do is to build a regressor to predict <img src='http://s0.wp.com/latex.php?latex=x_1+%3D+-s_i&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1 = -s_i' title='x_1 = -s_i' class='latex' /> from the feature vector <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> and apply this regressor to the <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> of all the suspects in each pool and rank them by its output.</p>
<p>So why is <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D1%5D+%26%2362%3B+E%5Bx_1%26%23124%3By%3D0%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=1] &gt; E[x_1&#124;y=0]' title='E[x_1&#124;y=1] &gt; E[x_1&#124;y=0]' class='latex' />?</p>
<p>First it is clear that <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D1%5D+%3D+-E%5Bs%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=1] = -E[s]' title='E[x_1&#124;y=1] = -E[s]' class='latex' /> (where <img src='http://s0.wp.com/latex.php?latex=s&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='s' title='s' class='latex' /> is the random variable describing the size of the suspect pool) because for each pool there is one murderer and we assign the negative of that pool size as <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> for that murderer.</p>
<p>Now there are a total of <img src='http://s0.wp.com/latex.php?latex=s_1+-1+%2B+s_2+-1%2B%5Cldots%2Bs_k-1+%3D+%5Csum+s_i+-+k&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='s_1 -1 + s_2 -1+&#92;ldots+s_k-1 = &#92;sum s_i - k' title='s_1 -1 + s_2 -1+&#92;ldots+s_k-1 = &#92;sum s_i - k' class='latex' /> non-murderers. For the non-murderers from the <img src='http://s0.wp.com/latex.php?latex=i%5E%7Bth%7D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='i^{th}' title='i^{th}' class='latex' /> pool the assigned <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> is <img src='http://s0.wp.com/latex.php?latex=-s_i&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='-s_i' title='-s_i' class='latex' />. Therefore the estimated expected value is <img src='http://s0.wp.com/latex.php?latex=-%5Cfrac%7B%5Csum+%28s_i-1%29.s_i%7D%7B%5Csum+s_i+-+k%7D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='-&#92;frac{&#92;sum (s_i-1).s_i}{&#92;sum s_i - k}' title='-&#92;frac{&#92;sum (s_i-1).s_i}{&#92;sum s_i - k}' class='latex' /></p>
<p>If we divide through by <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='k' title='k' class='latex' /> and let the number of pools go to infinity, the estimate converges to the true mean, which is</p>
<p><img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D0%5D+%3D+-%5Cfrac%7BE%5Bs%5E2%5D+-+E%5Bs%5D%7D%7BE%5Bs%5D+-+1%7D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=0] = -&#92;frac{E[s^2] - E[s]}{E[s] - 1}' title='E[x_1&#124;y=0] = -&#92;frac{E[s^2] - E[s]}{E[s] - 1}' class='latex' /></p>
<p>It is easy to show that this quantity is always lower than <img src='http://s0.wp.com/latex.php?latex=-E%5Bs%5D+%3D+E%5Bx_1%26%23124%3By%3D1%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='-E[s] = E[x_1&#124;y=1]' title='-E[s] = E[x_1&#124;y=1]' class='latex' /> if the variance of <img src='http://s0.wp.com/latex.php?latex=s&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='s' title='s' class='latex' /> is non-zero (and if <img src='http://s0.wp.com/latex.php?latex=E%5Bs%5D+%26%2362%3B+1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[s] &gt; 1' title='E[s] &gt; 1' class='latex' />), which we were given by the fact that the pool sizes are variable across murders.</p>
<p>And there you have it.</p>
<p><strong>A less criminal example</strong></p>
<p>Let me try to demystify the approach even more.</p>
<p>Assume that we have a database of images of faces of people along with their last names. Now the problem is the following &#8212; for a given face-name input pair to find the matching face and name in the database. Of course if all the names are unique this is trivial, but there are some common names like Smith and Lee.</p>
<p>Let us say that our procedure for this matching is to first obtain from the database the face images of the people with the same last name as the input, and then using our state-of-the-art face image comparison features to decide which of those is the right one. Do we need a training set of human-annotated input-database record pairs to train such a ranker? The above discussion suggests not.</p>
<p>Let us say we create a &#8220;fake&#8221; training set from a lot of unlabeled input image-name pairs by getting the set of records from the database with the matching name and assigning each feature vector a label which is the reciprocal of the size of the set. We then learn to predict this label from the image features.</p>
<p>The reason we would expect this predictor to behave similarly to the predictor of image match is that we can view our fake labeling as a probabilistic labeling of the feature vectors. For a given set our label is the a priori probability of a face image being the right match to the input image.</p>
<p>The independence assumption just makes sure that our probabilistic labeling is not biased.</p>
<p>I am curious about other possible applications.</p>
]]></content:encoded>
</item>
<item>
<title><![CDATA[Surrogate learning with mean independence]]></title>
<link>http://mlstat.wordpress.com/2009/08/07/surrogate-learning-with-mean-independence/</link>
<pubDate>Sat, 08 Aug 2009 00:41:03 +0000</pubDate>
<dc:creator>mlstat</dc:creator>
<guid>http://mlstat.wordpress.com/2009/08/07/surrogate-learning-with-mean-independence/</guid>
<description><![CDATA[In this paper we showed that if we had a feature that was class-conditionally statistically independ]]></description>
<content:encoded><![CDATA[<p>In <a href="http://www.aclweb.org/anthology/W/W09/W09-2202.pdf" target="_blank">this</a> paper we showed that if we had a feature <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> that was class-conditionally statistically independent of the rest of the features, denoted <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />, learning a classifier between the two classes <img src='http://s0.wp.com/latex.php?latex=y%3D0&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y=0' title='y=0' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=y+%3D+1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y = 1' title='y = 1' class='latex' /> can be transformed into learning a predictor of <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> from <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> and another of <img src='http://s0.wp.com/latex.php?latex=y&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y' title='y' class='latex' /> from <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' />. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> acts as a <em>surrogate </em>for <img src='http://s0.wp.com/latex.php?latex=y&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y' title='y' class='latex' />.</p>
<p>Similar ideas can be found in <a href="http://www.machinelearning.org/proceedings/icml2007/papers/154.pdf" target="_blank">Ando and Zhang &#8217;07</a>, <a href="http://icml2008.cs.helsinki.fi/papers/337.pdf" target="_blank">Quadrianto et. al. &#8217;08</a>, <a href="http://john.blitzer.com/papers/emnlp06.pdf">Blitzer et. al. &#8217;06</a>, and others.</p>
<p><strong>Derivation from mean-independence</strong></p>
<p>I&#8217;ll now derive a similar surrogate learning algorithm from <em>mean independence </em>rather than full statistical independence. Recall that the random variable <img src='http://s0.wp.com/latex.php?latex=U&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='U' title='U' class='latex' /> is mean-independent of  the r.v. <img src='http://s0.wp.com/latex.php?latex=V&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='V' title='V' class='latex' /> if <img src='http://s0.wp.com/latex.php?latex=E%5BU%26%23124%3BV%5D+%3D+E%5BU%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[U&#124;V] = E[U]' title='E[U&#124;V] = E[U]' class='latex' />. Albeit weaker than full independence, mean-independence is still a pretty strong assumption. In particular it is stronger than the lack of correlation.</p>
<p>We assume that the feature space contains the single feature <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> and the rest of the features <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />. We are still in a two-class situation, i.e., <img src='http://s0.wp.com/latex.php?latex=y+%5Cin+%5C%7B0%2C1%5C%7D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y &#92;in &#92;{0,1&#92;}' title='y &#92;in &#92;{0,1&#92;}' class='latex' />. We further assume</p>
<p>1. <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> is at least somewhat useful for classification, or in other words, <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D0%5D+%5Cneq+E%5Bx_1%26%23124%3By%3D1%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=0] &#92;neq E[x_1&#124;y=1]' title='E[x_1&#124;y=0] &#92;neq E[x_1&#124;y=1]' class='latex' />.</p>
<p>2. <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> is class-conditionally mean-independent of <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />, i.e., <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%2Cx_2%5D+%3D+E%5Bx_1%26%23124%3By%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y,x_2] = E[x_1&#124;y]' title='E[x_1&#124;y,x_2] = E[x_1&#124;y]' class='latex' /> for <img src='http://s0.wp.com/latex.php?latex=y+%5Cin+%5C%7B0%2C1%5C%7D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y &#92;in &#92;{0,1&#92;}' title='y &#92;in &#92;{0,1&#92;}' class='latex' />.</p>
<p>Now let us consider the quantity <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3Bx_2%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;x_2]' title='E[x_1&#124;x_2]' class='latex' />. We have</p>
<p style="line-height:19px;font:12px Monaco;margin:0 0 13px;"><img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3Bx_2%5D%3D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;x_2]=' title='E[x_1&#124;x_2]=' class='latex' /></p>
<p style="line-height:19px;font:12px Monaco;margin:0 0 13px;"><img src='http://s0.wp.com/latex.php?latex=%3DE%5Bx_1%26%23124%3Bx_2%2Cy%3D0%5D+P%28y%3D0%26%23124%3Bx_2%29+%2B+E%5Bx_1%26%23124%3Bx_2%2Cy%3D1%5D+P%28y%3D1%26%23124%3Bx_2%29+&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='=E[x_1&#124;x_2,y=0] P(y=0&#124;x_2) + E[x_1&#124;x_2,y=1] P(y=1&#124;x_2) ' title='=E[x_1&#124;x_2,y=0] P(y=0&#124;x_2) + E[x_1&#124;x_2,y=1] P(y=1&#124;x_2) ' class='latex' /></p>
<p style="line-height:19px;font:normal normal normal 12px/normal Monaco;margin:0 0 13px;"><img src='http://s0.wp.com/latex.php?latex=%3DE%5Bx_1%26%23124%3By%3D0%5D+P%28y%3D0%26%23124%3Bx_2%29+%2B+E%5Bx_1%26%23124%3By%3D1%5D+P%28y%3D1%26%23124%3Bx_2%29+&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='=E[x_1&#124;y=0] P(y=0&#124;x_2) + E[x_1&#124;y=1] P(y=1&#124;x_2) ' title='=E[x_1&#124;y=0] P(y=0&#124;x_2) + E[x_1&#124;y=1] P(y=1&#124;x_2) ' class='latex' /></p>
<p>Notice that <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3Bx_2%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;x_2]' title='E[x_1&#124;x_2]' class='latex' /> is a convex sum of <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D0%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=0]' title='E[x_1&#124;y=0]' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D1%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=1]' title='E[x_1&#124;y=1]' class='latex' />.</p>
<p>Now using the fact that <img src='http://s0.wp.com/latex.php?latex=P%28y%3D0%26%23124%3Bx_2%29+%2B+P%28y%3D1%26%23124%3Bx_2%29+%3D+1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(y=0&#124;x_2) + P(y=1&#124;x_2) = 1' title='P(y=0&#124;x_2) + P(y=1&#124;x_2) = 1' class='latex' />  we can show after some algebra that</p>
<p><img src='http://s0.wp.com/latex.php?latex=P%28y%3D1%26%23124%3Bx_2%29%3D%5Cfrac%7BE%5Bx_1%26%23124%3Bx_2%5D-E%5Bx_1%26%23124%3By%3D0%5D%7D%7BE%5Bx_1%26%23124%3By%3D1%5D-E%5Bx_1%26%23124%3By%3D0%5D%7D%5C%3B%5C%3B%5C%3B%5C%3B%5C%3B%5C%3B%281%29%26%2338%3Bs%3D1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(y=1&#124;x_2)=&#92;frac{E[x_1&#124;x_2]-E[x_1&#124;y=0]}{E[x_1&#124;y=1]-E[x_1&#124;y=0]}&#92;;&#92;;&#92;;&#92;;&#92;;&#92;;(1)&amp;s=1' title='P(y=1&#124;x_2)=&#92;frac{E[x_1&#124;x_2]-E[x_1&#124;y=0]}{E[x_1&#124;y=1]-E[x_1&#124;y=0]}&#92;;&#92;;&#92;;&#92;;&#92;;&#92;;(1)&amp;s=1' class='latex' /></p>
<p>We have succeeded in decoupling <img src='http://s0.wp.com/latex.php?latex=y&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y' title='y' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' /> on the right hand side, which results in a simple semi-supervised classification method. We just need the class-conditional means of <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> and a regressor (which can be learned on unlabeled data) to compute <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3Bx_2%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;x_2]' title='E[x_1&#124;x_2]' class='latex' />.  Again <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' /> acting as a surrogate for <img src='http://s0.wp.com/latex.php?latex=y&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='y' title='y' class='latex' /> is predicted from <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_2' title='x_2' class='latex' />.</p>
<p>As opposed to the formulation in the paper this formulation easily accommodates continuous valued <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' />.</p>
<p><strong>Discussion</strong></p>
<p>1. The first thing to note is that we are only able to write an expression for <img src='http://s0.wp.com/latex.php?latex=P%28y%26%23124%3Bx_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(y&#124;x_2)' title='P(y&#124;x_2)' class='latex' /> but not <img src='http://s0.wp.com/latex.php?latex=P%28y%26%23124%3Bx_1%2C+x_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(y&#124;x_1, x_2)' title='P(y&#124;x_1, x_2)' class='latex' />. That is, we are able to weaken the independence to mean-independence at the expense of &#8220;wasting&#8221; feature <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' />.</p>
<p>Of course if we have full statistical independence we can use <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='x_1' title='x_1' class='latex' />, by using Equation (1) and the fact that, under independence, we have</p>
<p><img src='http://s0.wp.com/latex.php?latex=P%28y%26%23124%3Bx_1%2C+x_2%29+%5Cpropto+P%28y%26%23124%3Bx_2%29+P%28x_1%26%23124%3By%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(y&#124;x_1, x_2) &#92;propto P(y&#124;x_2) P(x_1&#124;y)' title='P(y&#124;x_1, x_2) &#92;propto P(y&#124;x_2) P(x_1&#124;y)' class='latex' />.</p>
<p>2. If (without loss of generality) we assume that <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D1%5D+%26%2362%3B+E%5Bx_1%26%23124%3By%3D0%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=1] &gt; E[x_1&#124;y=0]' title='E[x_1&#124;y=1] &gt; E[x_1&#124;y=0]' class='latex' />, because <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3Bx_2%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;x_2]' title='E[x_1&#124;x_2]' class='latex' /> lies somewhere between <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D0%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=0]' title='E[x_1&#124;y=0]' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3By%3D1%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;y=1]' title='E[x_1&#124;y=1]' class='latex' />, Equation (1) says that <img src='http://s0.wp.com/latex.php?latex=P%28y%3D1%26%23124%3Bx_2%29&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='P(y=1&#124;x_2)' title='P(y=1&#124;x_2)' class='latex' /> is a monotonically increasing function of <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3Bx_2%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;x_2]' title='E[x_1&#124;x_2]' class='latex' />.</p>
<p>This means that <img src='http://s0.wp.com/latex.php?latex=E%5Bx_1%26%23124%3Bx_2%5D&amp;bg=ffffff&amp;fg=000&amp;s=0' alt='E[x_1&#124;x_2]' title='E[x_1&#124;x_2]' class='latex' /> itself can be used as the classifier, and labeled examples are needed only to determine a threshold for trading off precision vs. recall. The classifier (or perhaps we should call it a ranker) therefore is built <em>entirely</em> from unlabeled samples.</p>
<p>I&#8217;ll post a neat little application of this method soon.</p>
<p style="line-height:19px;font:12px Monaco;margin:0 0 13px;">
<p style="line-height:19px;font:12px Monaco;margin:0 0 13px;">
<p style="line-height:19px;font:12px Monaco;margin:0 0 13px;">
]]></content:encoded>
</item>
<item>
<title><![CDATA[Classification and Clustering]]></title>
<link>http://hyunsook.wordpress.com/2008/09/18/classification-and-clustering/</link>
<pubDate>Thu, 18 Sep 2008 23:48:47 +0000</pubDate>
<dc:creator>HLee</dc:creator>
<guid>http://hyunsook.wordpress.com/2008/09/18/classification-and-clustering/</guid>
<description><![CDATA[Another deduced conclusion from reading preprints listed in arxiv/astro-ph is that astronomers tend]]></description>
<content:encoded><![CDATA[<p>Another deduced conclusion from reading preprints listed in arxiv/astro-ph is that astronomers tend to confuse <b>classification and clustering</b> and to mix up methodologies. They tend to think any algorithms from classification or clustering analysis serve their purpose since both analysis algorithms, no matter what, look like a <b>black box</b>. I mean a black box as in neural network, which is one of classification algorithms. <!--more--></p>
<p>Simply put, classification is regression problem and clustering is mixture problem with unknown components. Defining a classifier, a regression model, is the objective of classification and determining the number of clusters is the objective of clustering. In classification, predefined classes exist such as galaxy types and star types and one wishes to know what prediction variables and their functional allow to separate Quasars from stars without individual spectroscopic observations by only relying on handful variables from photometric data. In clustering analysis, there is no predefined class but some plots visualize multiple populations and one wishes to determine the number of clusters mathematically not to be subjective in concluding remarks saying that the plot shows two clusters after some subjective data cleaning. A good example is that as photons from Gamma ray bursts accumulate, extracting features like F_{90} and F_{50} enables scatter plots of many GRBs, which eventually led people believe there are multiple populations in GRBs. Clustering algorithms back the hypothesis in a more objective manner opposed to the subjective manner of scatter plots with non statistical outlier elimination. </p>
<p>However, there are challenges to make a clear cut between classification and clustering both in statistics and astronomy. In statistics, missing data is the phrase people use to describe this challenge. Fortunately, there is a field called <b>semi-supervised learning</b> to tackle it. (Supervised learning is equivalent to classification and unsupervised learning is to clustering.) Semi-supervised learning algorithms are applicable to data, a portion of which has known class types and the rest are missing &#8212; astronomical catalogs with unidentified objects are a good candidate for applying  semi-supervised learning algorithms.</p>
<p>From the astronomy side, the fact that classes are not well defined or subjective is the main cause of this confusion in classification and clustering and also the origin of this challenge. For example, will astronomer A and B produce same results in classifying galaxies according to Hubble&#8217;s tuning fork? ((Check out the project, <a href="http://www.galaxyzoo.org">GALAXY ZOO</a>)) We are not testing individual cognitive skills. Is there a consensus to make a cut between F9 stars and G0 stars? What make F9.5 star instead of G0? With the presence of error bars, how one is sure that the star is F9, not G0? I don&#8217;t see any decision theoretic explanation in survey papers when those stellar spectral classes are presented. Classification is generally for data with categorical responses but astronomer tend to make something used to be categorical to continuous and still remain to apply the same old classification algorithms designed for categorical responses. </p>
<p>From a clustering analysis perspective, this challenge is caused by outliers, or peculiar objects that do not belong to the majority. The size of this peculiar objects can make up a new class that is unprecedented before. Or the number is so small that a strong belief prevails to discard these data points, regarded as observational mistakes. How much we can trim the data with unavoidable and uncontrollable contamination (remember, we cannot control astronomical data as opposed to earthly kinds)? What is the primary cause defining the number of clusters? physics, statistics, astronomers&#8217; experience in processing and cleaning data, &#8230;</p>
<p>Once the ambiguity in classification,  clustering, and the complexity of data sets is resolved, another challenge is still waiting. <b>Which black box?</b> For the most of classification algorithms, <a href="http://www.groundtruth.info/AstroStat/slog/2008/prml"> Pattern Recognition and Machine Learning</a> by C. Bishop  would offer a broad spectrum of black boxes. Yet, the book does not include various clustering algorithms that statisticians have developed in addition to outlier detection. To become more rigorous in selecting a black box for clustering analysis and outlier detection, one is recommended to check, </p>
<ul>
<li><a href="http://www.amazon.com/Cluster-Analysis-Brian-S-Everitt/dp/0340761199/ref=pd_bbs_1/105-1140178-6406047?ie=UTF8&#38;s=books&#38;qid=1187972525&#38;sr=8-1">Clustering Analysis</a> by Everitt, Landau, and Leese
</li>
<li><a href="http://www.amazon.com/Data-Clustering-Algorithms-Applications-Probability/dp/0898716233/ref=pd_cp_b_3?pf_rd_p=413864201&#38;pf_rd_s=center-41&#38;pf_rd_t=201&#38;pf_rd_i=0340761199&#38;pf_rd_m=ATVPDKIKX0DER&#38;pf_rd_r=1PCTT1ZQF1QKF39JNMX1">Data Clustering: Theory, Algorithms, and Applications</a> by Gan, Ma, and Wu
</li>
<li>collection of articles and presentation files on nonparametric multivariate analysis by <a href="http://www.utdallas.edu/~serfling/">Robert Serfling</a> (Yes, the author of the classical book, <i>Approximation Theorems of Mathematical Statistics</i>), particularly about data depth and outlier detection and
</li>
<li> <a href="http://www.groundtruth.info/AstroStat/slog/2008/a-conversation-with-peter-huber">robust statistics</a> by Peter Huber
</li>
</ul>
<p>For me, astronomers tend to be in a haste owing to the pressure of publishing results immediately after data release and to overlook suitable methodologies for their survey data. It seems that there is no time for consulting machine learning specialists to verify the approaches they adopted. My personal prayer is that this haste should not be settled as a trend in astronomical survey and large data analysis.</p>
]]></content:encoded>
</item>

</channel>
</rss>
