<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>machine-learning &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/machine-learning/</link>
	<description>Feed of posts on WordPress.com tagged "machine-learning"</description>
	<pubDate>Sat, 28 Nov 2009 09:22:09 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[What if Chuck Norris was available?]]></title>
<link>http://statsravingmad.wordpress.com/2009/11/23/what-if-chuck-norris-was-available/</link>
<pubDate>Sun, 22 Nov 2009 22:15:30 +0000</pubDate>
<dc:creator>Manos Parzakonis</dc:creator>
<guid>http://statsravingmad.wordpress.com/2009/11/23/what-if-chuck-norris-was-available/</guid>
<description><![CDATA[Ool la la! Basic statistics is more useful than advanced machine learning. I can’t tell you how many]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Ool la la!</p>
<blockquote><p><em>Basic statistics is more useful than advanced machine learning.</em></p>
<p><em> I can’t tell you how many interviews I’ve had where someone has a really cool project on their resume. Support vector machines, topic analysis on CiteSeer, or whatever… But what it boils down to is someone took toy data set A and plugged it in to machine learning library B and took the output and was like, “sweet.”</em></p>
<p><em> People with “machine learning” on their resume fall from the sky these days, it seems to be a very sexy discipline. The problem is if I ask them explain a t-test, those same people can’t tell me what that is.</em></p>
<p><em> If I had a MacGyver of data analysis and all he had was a t-test and regression, he would probably be able to do 99.9% of the analyses that we do that are actually useful.</em> (<a href="http://jeffhammerbacher.com/">Jeff Hammerbacher</a>)</p></blockquote>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Công thức của vận may (Phần 2)]]></title>
<link>http://chobau.wordpress.com/2009/11/19/cong-thuc-cua-van-may-phan-2/</link>
<pubDate>Thu, 19 Nov 2009 07:06:58 +0000</pubDate>
<dc:creator>chobau</dc:creator>
<guid>http://chobau.wordpress.com/2009/11/19/cong-thuc-cua-van-may-phan-2/</guid>
<description><![CDATA[“Công thức của vận may” là một trong những cuốn sách hấp dẫn nhất của William Poundstone (tác giả cu]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><em>“Công thức của vận may”</em> là một trong những cuốn sách hấp dẫn nhất của<strong> William Poundstone</strong> (tác giả cuốn sách bán chạy tại <strong>Việt Nam</strong> <em>“Làm thế nào dịch chuyển núi Phú Sĩ?”</em>). Tác giả tiết lộ bí mật về một công thức toán học &#8211; được gọi là “<em>Công thức Kelly</em>” giúp bạn nắm bắt được vận may tại sòng bạc và sàn chứng khoán. Một cuốn sách kết hợp giữa cờ bạc, cá ngựa, đầu tư chứng khoán và sự chính xác của toán học, một cuốn cẩm nang cho những người muốn áp dụng công thức Kelly để làm giàu.<br />
<strong><!--more-->Dự án X</strong></p>
<p><strong>NÓ ĐƯỢC GỌI LÀ DỰ ÁN X</strong>. Mãi đến năm 1976, dự án này mới được tiết lộ và đây là một nỗ lực chung của cả Bell Labs và Trường Mật mã và Kí hiệu Anh quốc đặt tại Bletchley Park, miền bắc Luân Đôn. Nội dung của dự án X mang tính cạnh tranh với nội dung của dự án Manhattan, do một nhóm các nhà khoa học Anh và Mỹ đảm nhận, không chỉ có Shannon mà cả Alan Turing tham gia. Họ đang xây dựng một hệ thống gọi là SIGSALY. Đây không phải là chữ viết tắt của một cái tên nào mà chỉ là sự kết hợp ngẫu nhiên của một nhóm chữ cái nhằm làm cho người Đức lúng túng nếu như họ có nghiên cứu về nó.</p>
<p>SIGSALY là chiếc điện thoại vô tuyến và có thể đổi tần số kỹ thuật số đầu tiên. Mỗi chiếc SIGSALY là một cỗ máy tính có kích cỡ bằng một căn phòng, nặng 55 tấn với một phòng riêng biệt dành cho người sử dụng và một hệ thống điều hòa không khí để ngăn không cho các đèn điện tử chân không của máy bị nóng chảy. Chiếc máy này là giải pháp giúp các nhà lãnh đạo phe đồng minh nói chuyện với nhau một cách thoải mái mà không lo bị kẻ thù nghe trộm. Phe Đồng minh đặt một chiếc SIGSALY ở Lầu Năm Góc cho Roosevelt và một chiếc khác ở tầng hầm của một cửa hàng Selfridges cho Churchill. Hai chiếc nữa được đặt ở Bắc Mỹ cho Field Marshal Montgomery và ở Guam cho tướng MacArthur.</p>
<p>SIGSALY sử dụng một hệ thống mật mã duy nhất được coi là không thể bẻ gãy – mật mã <em>“mã hóa một lần”</em>. <em>“Từ khóa”</em> dùng cho một tin nhắn được xáo trộn và mã hóa một cách ngẫu nhiên. Chìa khóa để giải mã bao gồm một dãy những chữ cái hay chữ số được sắp xếp không theo một quy luật nào, do đó thông tin của chìa khóa cũng là ngẫu nhiên, không chứa đựng bất kì một quy luật nào có thể dựa vào đó mà giải mã. Vấn đề của mật mã “mã hóa một lần” này là chìa khóa phải được người đưa tin chuyển đến tất cả những người đang sử dụng hệ thống, một thách thức thực sự trong thời chiến.</p>
<p>SIGSALY mã hóa dữ liệu âm thanh tốt hơn dữ liệu văn bản. Chìa khóa của nó là một đĩa nhựa ghi những “<em>tiếng ồn trắng”</em> ngẫu nhiên. Khi thêm những<em> “tiếng ồn trắng</em>” này vào giọng nói củaRoosevelt sẽ khiến giọng nói rít lên như tiếng huýt gió, không thể nào hiểu được. Cách duy nhất để xác định Roosevelt nói gì là đem so sánh những tiếng ồn với một đĩa nhựa và <em>“loại bỏ</em>” những đoạn giống nhau. Sau khi gõ đúng con số mà chìa khóa yêu cầu, đoạn băng gốc bị phá hủy, những bản sao trên đĩa LP sẽ được những người đưa tin đáng tin cậy chuyển đến các nơi đặt máy SIGSALY. Điều tối quan trọng là những chiếc máy đọc đĩa của SIGSALY phải chạy ở cùng một tốc độ với sự chính xác tuyệt đối. Khi một chiếc bị sai lệch nhẹ, lập tức tiếng động đưa ra sẽ bị thay bằng tiếng ồn.</p>
<p>Alan Turing đã bẻ gãy được bộ mã<em> “Enigm”</em> của người Đức, giúp phe Đồng minh có thể nghe trộm được những mật lệnh của người Đức. Mục đích của SIGSALY là đảm bảo không cho người Đức làm được điều ngược lại với phe Đồng minh. Một phần công việc của Shannon là chứng minh rằng hệ thống này thực sự bất khả xâm phạm đối với bất cứ người nào không có chìa khóa giải mã trong tay. Nếu thiếu cơ sở đảm bảo về mặt toán học này, những tướng lĩnh quân Đồng minh sẽ không thể nào liên lạc với nhau một cách thoải mái. SIGSALY đã lần đầu tiên áp dụng một vài ý tưởng của Shannon vào thực tế, trong số đó có những ý tưởng liên quan đến phương pháp điều biến mã xung (pulse code modulation – một phương pháp được sử dụng để biến đổi tín hiệu tương tự ở lối vào thành tín hiệu số tương ứng, không bị nhiễu). AT&#38;T đã cấp bằng sáng chế và thương mại hóa nhiều ý tưởng của Shannon trong thời kỳ hậu chiến.</p>
<p>Sau đó, Shannon nói rằng việc nghiên cứu về cách thức che giấu thông tin bằng tiếng động ngẫu nhiên đã thúc đẩy tiến trình xây dựng lý thuyết thông tin. Ông nói: <em>“Một hệ thống bảo mật cũng gần như giống hệt một hệ thống liên lạc bằng tiếng động”</em>.  Hai hướng nghiên cứu này <em>“có mối liên hệ mật thiết đến nỗi bạn không thể tách rời chúng được”.<br />
</em><br />
Năm 1943, Alan Turing đến thăm phòng nghiên cứu của Bell Labs ở New York. Hằng ngày, Turing và Shannon đều có những cuộc trò chuyện trong quán cà phê ở nơi làm việc. Shannon thông báo với Turing rằng ông đang theo đuổi cách thức để đo lường được thông tin. Ông sử dụng một đơn vị đo lường gọi là “bit” và nói đây là ý tưởng đặt tên của John Tukey, một nhà toán học khác ở Bell Labs. “<em>Bit”</em> là chữ viết tắt của<em> “binary digit”</em> – “<em>số nhị phân”</em>. Theo Shannon định nghĩa, bit là một tổng lượng thông tin cần thiết để phân biệt giữa hai kết quả cho ra ngang nhau.</p>
<p>Turning nói với Shannon rằng anh vừa nảy ra ý tưởng về một đơn vị gọi là “<em>ban”</em>, là tổng lượng dữ liệu làm tăng khả năng chính xác của một dự đoán lên gấp 10 lần. Nhà mật mã học người Anh lấy ý tưởng này một phần từ việc giải mã hệ thống mật mã Enigma của người Đức. “<em>Ban</em>” xuất phát từ “<em>Banbury</em>”, tên thị trấn đã sản xuất ra những tờ giấy mà đội mật mã sử dụng.</p>
<p>Chính “<em>bit”</em> chứ không phải “<em>ban</em>” đã làm thay đổi thế giới, chính xác là từ năm 1948. Sau chiến tranh Shannon vẫn tiếp tục làm việc cho Bell Labs. Một hôm ông đặt một bản kế hoạch kì lạ lên bàn làm việc của đồng nghiệp và hỏi đây là cái gì. William Shockley – tên của nhà nghiên cứu trả lời:</p>
<p><strong><em>“Nó là một chiếc máy tăng âm chỉ dùng bán dẫn”.</em></strong></p>
<p>Đó là thiết bị bán dẫn (<strong>transistor</strong>) đầu tiên trên thế giới. Shockley nói với Shannon chiếc máy khuếch đại này  có thể làm bất kì điều gì mà đèn điện tử chân không có thể làm được.</p>
<p>Nó rất nhỏ. Shannon nhận thấy thiết bị mới này hoạt động bằng cách cho các chất khác nhau tiếp xúc với nhau. Nó có thể nhỏ như mong muốn, miễn là trong phạm vi các chất còn tiếp xúc được.</p>
<p>Bóng bán dẫn là một công cụ rất hữu ích có thể biến nhiểu ứng dụng trong lý thuyết của Shannon vào thực tế. Sự việc tình cờ này xảy ra vào cuối năm 1948 hoặc đầu năm 1948, trước khi Bell Labs công bố phát minh về bóng bán dẫn vào ngày 30/6 và chỉ cách thời điểm lý thuyết thông tin kinh điển của Shannon xuất hiện.</p>
<p>Có một vụ xì căng đan nhỏ liên quan đến những tài liệu này. Shannon cho đăng bài báo <em>“Một lý thuyết toán về truyền thông”</em> trên tạp chí<strong> Bell System Technical</strong> năm 1948. Khi đó ông 32 tuổi. Phần lớn công việc liên quan đến lý thuyết thông tin đều được hoàn thành từ nhiều năm trước đó, trong khoảng từ 1939 đến 1943. Shannon chỉ kể cho một số người về công việc mà ông đang tiến hành. Theo thói quen, ông làm việc một mình trong văn phòng lúc nào cũng đóng kín cửa. Khi biết công trình này, những người ở Bell Labs lấy làm ngạc nhiên vì Shannon đã đạt được một kết quả quan trọng như vậy và họ muốn tham gia vào. Điều đó chẳng khác gì một phát minh khoa học, và họ thúc giục Shannon công bố lý thuyết này. Shannon nhớ lại quá trình hoàn thành công trình như một cơn ác mộng. Ông khẳng định việc mình xây dựng lý thuyết này nằm ngoài sự tò mò thuần túy, đó là khát vọng vươn tới những công nghệ tiên tiến hay hoàn thiện sự nghiệp của mình.</p>
<p>Năm 1948 cũng đánh dấu một bước ngoặt trong đời sống riêng của Shannon. Shannon thường lui tới văn phòng của John Pierce để trò chuyện. Pierce đang nghiên cứu về ra-đa và được biết đến như một người hâm mộ cuồng nhiệt thể loại tiểu thuyết khoa học. Trong những lần tới văn phòng này, Shannon đã gặp trợ lý củaPierce , Mary Elizabeth Moore. “Betty” Moore đang tham gia trực máy tính cho nhóm toán học, các thao tác được thực hiện trên một cái máy tính bàn kiểu cũ. Moore rất hoạt bát, có khả năng làm mọi thứ theo cách “<em>Rosie-the-Riveter</em>”, có thể dùng máy  khoan, máy tiện trong xưởng máy của phòng thí nghiệm. Cô có sức hấp dẫn và là một trong ba phụ nữ duy nhất làm việc ở đây. (“<em>Một người đã có chồng còn người kia đã 50 tuổi”</em>, Betty nhớ như vậy.) Cô và Claude hẹn hò lần đầu tiên vào tháng 12 năm 1948. Ngày 27-3-1949 họ cưới nhau.</p>
<p>Shannon bắt đầu dạy ở MIT vào học kỳ mùa xuân năm 1956. Lúc đầu công việc này chỉ mang tính tạm thời, và có ít nhất một người bạn ở Bell Labs (John Riordan) hiểu rằng có một lý do không được nói ra đằng sau việc này. Người ta đoán Shannon chuyển sang dạy học ở MIT là để có nhiều thời gian rảnh hơn cho việc viết sách về lý thuyết thông tin.</p>
<p>Trong một bức thư gửi cho Hendrik Bode – sếp của mình tại Bell Labs, Shannon viết: “<em>Tôi đang có một khoảng thời gian hết sức vui vẻ tại MIT. Đề tài đang tiến triển rất tốt nhưng khối lượng công việc rất lớn. Ban đầu tôi chỉ mong có một nhóm nhỏ khoảng 8 đến 10 sinh viên giỏi, thế nhưng ngay ngày đầu đã có tới 40 người đến đăng ký, trong đó có cả các giảng viên của MIT, Harvard…”</em></p>
<p>Chỉ sau vài tháng ở MIT, Shannon gửi Bode đơn xin thôi làm việc tại Bell Labs và chuyển sang giảng dạy tại MIT. Ông thấy mình và Betty thích cuộc sống trí thức và nền văn hóa của Cambridge. “<em>Những khách nước ngoài thường dành cả ngày ở Bell Labs nhưng lại dành đến 6 tháng ở MIT. Điều này đem lại những cơ hội để trao đổi kinh nghiệm và ý tưởng thực tế. Khi tính đến tất cả những thuận lợi và khó khăn, tôi thấy Bell Labs và môi trường chuyên môn cao ở đây đều quan trọng như nhau, nhưng 15 năm tại Bell Labs khiến tôi cảm thấy mình trở nên hơi nhàm chán và làm việc kém hiệu quả hơn. Tôi nghĩ một sự thay đổi về môi trường nghiên cứu và những đồng nghiệp mới sẽ kích thích tôi làm việc tốt hơn.</em>” Shannon giải thích với Bode.</p>
<p>Shannon tiếp cận MIT với đề nghị về một công việc ổn định và lâu dài tại đây. Tiền bạc không phải là vấn đề. Bell Labs đã đề  nghị một mức lương vô cùng hấp dẫn nhưng Shannon từ chối (ông vẫn tiếp tục cộng tác với Bell Labs cho tới tận năm 1972). Mức lương khởi điểm của ông tại MIT là 17.000 đô la  một năm.</p>
<p>Shannon chỉ ưa thích sự khích lệ tại MIT ở một chừng mực nào đó. Thường thì ông làm việc hiệu quả nhất khi chỉ có một mình. Có lẽ ông đã đánh giá thấp mức độ phiền toái mà danh tiếng của một “huyền thoại sống” như ông đem đến khi ở ngôi trường rộng lớn giữa thành phố này. Shannon  “bắt đầu ít xuất hiện, cứ như là ông ta biến mất vậy.” Robert Fano nhớ lại.</p>
<p>Shannon nhận một vài sinh viên có bằng Tiến sĩ. Họ thường phải gặp ông ở nhà riêng để nghe ông chỉ bảo. Một sinh viên tên là William Sutherland nhớ mình đã hơn một lần đến nhà Shannon trong khi ông đang luyện kèn ô-boa. Betty kể: “<em>Ông ấy ngủ bất kì khi nào thấy buồn ngủ và thường ngồi hàng giờ ở cái bàn ăn và suy nghĩ.”</em></p>
<p>Shannon không còn cho xuất bản những công trình nghiên cứu của mình nữa. Cuốn sách mà ông nói đến không bao giờ được hoàn thành. Những tài liệu của ông ở Thư viện Quốc hội không có gì ngoài mấy bản viết tay liên quan đến kế hoạch này.</p>
<p>Người đi tiên phong trong lĩnh vực trí thông minh nhân tạo Marvin Minsky cho rằng Shannon ngừng những công việc liên quan đến lĩnh vực thông tin vì ông cảm thấy những gì cần phải chứng minh ông đều đã hoàn thành. Sự độc lập trong công việc là điều mà không ai hơn được Shannon. Ý Fano muốn nói đây là một hiện tượng không bình thường. Trong một vài trường hợp ngoại lệ hiếm hoi, khi một nhà khoa học về lý thuyết  thông tin đề cập đến một lỗi sai nào đó với Shannon, <em>(a) – ông đã biết về lỗi đó</em> và <em>(b) – ông đã sửa lại, nhưng không có ý định thông báo điều đó</em>.</p>
<p><em>“Tôi chỉ theo đuổi và phát triển những sở thích khác nhau của bản thân. Vì cuộc sống luôn luôn thay đổi nên bạn cũng sẽ phải thay đổi hướng đi của mình.”</em> Shannon nói về sự phóng túng trong cách làm việc của mình.</p>
<p>Một trong những sở thích này là trí thông minh nhân tạo. Shannon tổ chức cuộc hội thảo chuyên môn đầu tiên về đề tài này ở Dartmouth, năm 1956. Tiếng tăm của Shannon là một yếu tố khiến người ta quan tâm nhiều đến đề tài này.  Một vài thiết bị mà Shannon tạo nên, bao gồm chiếc máy tính biết chơi cờ đầu tiên và một chiếc máy thông minh khác, là những viên gạch đầu tiên trong lịch sử của học máy (machine learning). Shannon là một người ủng hộ ăn nói lưu loát, có tầm nhìn đủ để biết điều tuyệt vời nào có thể thành hiện thực đồng thời cũng thực tế để hiểu rằng chúng sẽ không xuất hiện trong cuộc đời ông. Ông có tài thiên bẩm trong việc né tránh những câu hỏi vụng về vẫn thường thấy:</p>
<p><strong>Hỏi:</strong> Ông có nghĩ những con rô bốt sẽ đủ thông minh để làm bạn với con người không?</p>
<p><strong>Trả lời:</strong> Tôi nghĩ là được. Nhưng tương lai ấy vẫn còn khá xa.</p>
<p><strong>Hỏi: </strong>Ông có thể tưởng tượng ra một rô bốt làm Tổng thống Mỹ sẽ như thế nào không?</p>
<p><strong>Trả lời:</strong> Có, tôi có thể tưởng tượng được. Còn bây giờ tôi nghĩ anh không nên nói về Hoa Kỳ nữa. Đó là một vấn đề hoàn toàn khác.</p>
<p>Rất nhiều thư từ, tài liệu, điện thoại từ những nhà khoa học nổi tiếng thế giới đổ về văn phòng của Shannon.  Họ muốn Shannon duyệt hộ một bài báo hay viết cho họ một bài, muốn mời ông đến nói chuyện, bày tỏ quan điểm hay cho một lời khuyên. Shannon từ chối hết những lời đề nghị, yêu cầu này. Khi tên tuổi  của Shannon được công chúng biết đến một cách rộng rãi, ông bắt đầu nhận được những lá thư từ các trường học đang xây dựng dự án khoa học cho bọn trẻ và những kẻ lập dị theo đuổi nhiều ý tưởng hoang đường về khoa học, máy tính và công ty điện thoại (“Thưa ngài” – dòng mở đầu một lá thư,<em> “Con rô bốt Bel của ngài, một biểu tượng (Daniel 14) trong Kinh thánh, là một cỗ máy quái vật,… Ông đang chế tạo nên một kẻ phản bội, giúp sức cho Tổng thống Hoa Kỳ và FBI bằng cách để cho con rô bốt lừa gạt mình. Tôi sợ là mình sẽ phải kiện công ty Điện thoại New York  và tôi sẽ làm nếu như ông không thức tỉnh”</em>).</p>
<p>Thỉnh thoảng CIA và các cơ quan khác vẫn tìm đến Shannon mỗi khi gặp phải khó khăn trong việc giải những bức mật mã. Shannon đã lịch sự nhắc họ rằng mình đã nghỉ hưu. Trong một bức thư của Philip H.McCallum – một nhân viên CIA viết năm 1983 có đoạn: <em>“Chúng tôi hoàn toàn không lựa chọn ngài một cách ngẫu nhiên. Chúng tôi cần một bộ óc siêu phàm với những ý tưởng độc đáo và chúng tôi phải chấp nhận rằng ngài luôn luôn là người đầu tiên mà chúng tôi nghĩ tới… Mặc dù chúng tôi hiểu rằng ngài không quan tấm tới tiền bạc, nhưng chúng tôi vẫn sẽ trả tiền thù lao cho ngài.”</em></p>
<p>Shannon không thích trả lời một lá thư đến khi nào ông soạn được một câu trả lời hoàn hảo. Để làm được điều này phải mất một thời gian, vì thế ông sắp xếp những lá thư thành từng ngăn một, trên đó có dán những cái nhãn như “Thư chưa trả lời quá lâu”. Những lá thư này hiện được lưu giữ cẩn thận cùng các tài liệu khác của Shannon ở Thư viện Quốc hội, rất nhiều lá vẫn còn đợi được hồi âm.</p>
<p>Khi về nghỉ hưu sớm và không chính thức, Shannon mới 40 tuổi. Sau Shannon là một nhân viên khác của MIT, Bartleby, người có câu trả lời rất cá tính “Tôi không thích làm việc này nữa” – nghĩa là chức danh thư ký phòng quản lý những lá thư không ai nhận.</p>
<p><strong>Trích</strong> từ “<em>Công thức của vận may</em>”, <strong>NXB Trẻ</strong>, <em>Hoàng Trung</em>, <em>Hồng Vân</em> dịch</p>
<p>(Còn nữa)</p>
<p><em>Kỳ tới</em>: <strong>Người mách nước cá ngựa trong bộ com-lê xấu xí</strong><strong></strong></p>
<p style="text-align:right;"><strong>Tuấn Phong</strong></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Apache Mahout 0.2 Released - Now classify, cluster and generate recommendations!]]></title>
<link>http://techdigger.wordpress.com/2009/11/18/apache-mahout-0-2-released-now-classify-cluster-and-generate-recommendations/</link>
<pubDate>Wed, 18 Nov 2009 13:48:32 +0000</pubDate>
<dc:creator>TechDigger</dc:creator>
<guid>http://techdigger.wordpress.com/2009/11/18/apache-mahout-0-2-released-now-classify-cluster-and-generate-recommendations/</guid>
<description><![CDATA[Apache Mahout For the past two years, I have been working with this amazing bunch of people whilst, ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><div class="wp-caption alignright" style="width: 92px"><a href="http://lucene.apache.org/mahout"><img src="http://lucene.apache.org/mahout/images/Mahout-logo-82x100.png" alt="Apache Mahout" width="82" height="100" /></a><p class="wp-caption-text">Apache Mahout</p></div>
<p align="justify">
For the past two years, I have been working with this amazing bunch of people whilst, being paid by Google in their summer of code program in a project called <a href="http://lucene.apache.org/mahout">Mahout</a>. And like the name says, it is trying to tame the young beast known as <a href="http://hadoop.apache.org">Hadoop</a>. I have received a lot from the community. Being part of the project, I have got some real exposure to Java, data mining, machine learning and hands on experience over distributed systems like <a href="http://hadoop.apache.org">Hadoop</a>, <a href="http://hadoop.apache.org/hbase">Hbase</a>, <a href="http://hadoop.apache.org/pig">Pig</a>.  The project is still in its infancy, but, its ambitions are high in the sky. I am happy to announce the second release of the project, and proud to be a part of it. I hope people will adapt it in their projects and that it becomes the defacto standard machine learning library the way lucene and hadoop has become in their respective focus areas.
</p>
<p>If you are already excited and want to take it for a ride, read Grant&#8217;s article on IBM developerworks <a href="https://www.ibm.com/developerworks/java/library/j-mahout/index.html">here</a><br />
The release announcement below</p>
<div align="justify" style="font-size:90%;border:1px dashed #337733;padding:10px;">
<p>Apache Mahout 0.2 has been released and is now available for public download at<a href="http://www.apache.org/dyn/closer.cgi/lucene/mahout">http://www.apache.org/dyn/closer.cgi/lucene/mahout</a></p>
<p>Up to date maven artifacts can be found in the Apache repository at<br />
<a href="https://repository.apache.org/content/repositories/releases/org/apache/mahout/">https://repository.apache.org/content/repositories/releases/org/apache/mahout/</a></p>
<p>Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. http://www.apache.org/licenses/LICENSE-2.0</p>
<p>Mahout is a machine learning library meant to scale: Scale in terms of community to support anyone interested in using machine learning. Scale in terms of business by providing the library under a commercially friendly, free software license. Scale in terms of computation to the size of data we manage today.</p>
<p>Built on top of the powerful map/reduce paradigm of the Apache Hadoop project, Mahout lets you solve popular machine learning problem settings like clustering, collaborative filtering and classification<br />
over Terabytes of data over thousands of computers.</p>
<p>Implemented with scalability in mind the latest release brings many performance optimizations so that even in a single node setup the library performs well.</p>
<p>The complete changelist can be found here:</p>
<p><a href="http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278">http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278</a></p>
<p>New Mahout 0.2 features include</p>
<ul>
<li>Major performance enhancements in Collaborative Filtering, Classification and Clustering</li>
<li>New: Latent Dirichlet Allocation(LDA) implementation for topic modelling</li>
<li>New: Frequent Itemset Mining for mining top-k patterns from a list of transactions</li>
<li>New: Decision Forests implementation for Decision Tree classification (In Memory &#38; Partial Data)</li>
<li>New: HBase storage support for Naive Bayes model building and classification</li>
<li>New: Generation of vectors from Text documents for use with Mahout Algorithms</li>
<li>Performance improvements in various Vector implementations</li>
<li>Tons of bug fixes and code cleanup</li>
</ul>
<p>Getting started: New to Mahout?</p>
<ul>
<li> Download Mahout at <a href="http://www.apache.org/dyn/closer.cgi/lucene/mahout">http://www.apache.org/dyn/closer.cgi/lucene/mahout</a></li>
<li> Check out the Quick start: <a href="http://cwiki.apache.org/MAHOUT/quickstart.html">http://cwiki.apache.org/MAHOUT</a></li>
<li> Read the Mahout Wiki: <a href="http://cwiki.apache.org/MAHOUT">http://cwiki.apache.org/MAHOUT</a></li>
<li> Join the community by subscribing to mahout-user@lucene.apache.org</li>
<li> Give back: <a href="http://www.apache.org/foundation/getinvolved.html">http://www.apache.org/foundation/getinvolved.html</a></li>
<li> Consider adding yourself to the power by Wiki page:<a href="http://cwiki.apache.org/MAHOUT/poweredby.html">http://cwiki.apache.org/MAHOUT/poweredby.html</a></li>
</ul>
<p>For more information on Apache Mahout, see <a href="http://lucene.apache.org/mahout">http://lucene.apache.org/mahout</a>
</div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[What is and Who needs "Machine Learning"?]]></title>
<link>http://someoneatsomewhere.wordpress.com/2009/11/18/what-is-and-who-need-machine-learning/</link>
<pubDate>Wed, 18 Nov 2009 08:54:08 +0000</pubDate>
<dc:creator>someoneatsomewhere</dc:creator>
<guid>http://someoneatsomewhere.wordpress.com/2009/11/18/what-is-and-who-need-machine-learning/</guid>
<description><![CDATA[Machine Learning, which nowadays also merged with Pattern Recognition, has many successful applicati]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Machine Learning, which nowadays also merged with Pattern Recognition, has many successful applications. Nevertheless, the two questions above are always difficult to explain to others. My probability of giving good answers to these two fundamental questions to laymen are very low.</p>
<p>Fortunately, I discover Prof. Nello Cristianini&#8217;s talk on the topics. Here they are:</p>
<p><a href="http://videolectures.net/aop05_cristianini_ap/">The Analysis of Patterns<br />
</a></p>
<p><a href="http://videolectures.net/aop07_cristianini_wnd/">Who needs Patterns?</a></p>
<p>These two are also very interesting.</p>
<p><a href="http://videolectures.net/ecmlpkdd09_christianini_awty/">Are We There Yet?</a></p>
<p><a href="http://videolectures.net/aop05_chaitin_sin/">Nello Cristianini asking Gregory Chaitin about &#8220;Pattern&#8221;</a></p>
<p>Have Fun!</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[List of Useful Books]]></title>
<link>http://someoneatsomewhere.wordpress.com/2009/11/17/list-of-useful-books/</link>
<pubDate>Tue, 17 Nov 2009 08:08:26 +0000</pubDate>
<dc:creator>someoneatsomewhere</dc:creator>
<guid>http://someoneatsomewhere.wordpress.com/2009/11/17/list-of-useful-books/</guid>
<description><![CDATA[Someone created a list of machine learning and related books which are very useful: http://www.amazo]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Someone created a list of machine learning and related books which are very useful:<br />
<a href="http://www.amazon.com/Pattern-Analysis-Foundations-and-Mathematics/lm/R2L8XSDG2DT9MR/ref=cm_lm_byauthor_title_full">http://www.amazon.com/Pattern-Analysis-Foundations-and-Mathematics/lm/R2L8XSDG2DT9MR/ref=cm_lm_byauthor_title_full</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Agent Framework]]></title>
<link>http://aiguy.wordpress.com/2009/11/16/agent-framework/</link>
<pubDate>Tue, 17 Nov 2009 05:45:03 +0000</pubDate>
<dc:creator>aiguy</dc:creator>
<guid>http://aiguy.wordpress.com/2009/11/16/agent-framework/</guid>
<description><![CDATA[As part of my ongoing agent training, I re-read the agent environment algorithm in Chapter 2 of AIMA]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>As part of my ongoing agent training, I re-read the agent environment algorithm in Chapter 2 of AIMA (Russell and Norvig, 1995).  The key challenge is how to convert an algorithm with repeat and for loops into viable Prolog code.  At first inspection, the implementation is best suited for a language with control loops built into the programming language (e.g., LISP).  My primary Prolog books [Bratko 1990] and [Sterling and Shapiro 1994] did not provide the required help.  However, upon rereading Prolog Programming In Depth (Covington, Nute, and Vellino 1997), I was able to figure out how to write loops.</p>
<p><!--more-->Hence, I wrote a draft of the prolog code in paper.  Next, I prepared my test Linux environment and using a text editor put the code into files.  Next, I had to write a test file to test the individual functions.  After much debugging, revising, and rewriting, I had a functional running environment for an agent framework.</p>
<p>Enclosed is my code (only displayed for educational, noncommercial purposes):</p>
<p>&#160;</p>
<pre>
/*
 * blockenv.pl
 */

run_environment(State, Updatefn, Agents, Termination) :-
   get_percepts(Agents, State, Percepts),
   get_actions(Percepts, Actions),
   call(Updatefn, Actions, State, NewState),
   !,
   (call(Termination, NewState) -&#62; true ;
   run_environment(NewState, Updatefn, Agents, Termination)).

get_percepts([], State, []).

get_percepts([Agent&#124;Agents], State, Percepts) :-
    get_percepts(Agents, State, Percepts1),
    append([percept(Agent, State)], Percepts1, Percepts).

get_actions([], []).

get_actions([Percept&#124;Percepts], Actions) :-
    get_actions(Percepts, Actions1),
    arg(1, Percept, Agent),
    arg(2, Percept, AgentPercept),
    call(Agent, AgentPercept, Action),
    append([action(Agent, Action)], Actions1, Actions).
</pre>
<pre>
/*
 * blktest.pl
 */

:- consult('../environment/blockenv.pl').

test_percepts(Percepts) :-
   get_percepts([planner], [handempty, clear(a), ontable(a), counter(1)], Percepts).

test_actions(Actions) :-
   test_percepts(Percepts),
   get_actions(Percepts, Actions).

planner(_, noaction).

update_counter([], []).

update_counter([H&#124;Rest], NewState) :-
   update_counter(Rest, NewState1),
   ( counter(X) = H -&#62; Y is X - 1 ,
     append([counter(Y)], NewState1, NewState)
   ;
     append([H], NewState1, NewState)
   ).

update_state(_, State, NewState) :-
   update_counter(State, NewState).

termination(State) :-
   member(counter(0), State).

test_environment :-
   run_environment([handempty, counter(2)], update_state, [planner],
           termination).
</pre>
<p>Enclosed is a test trace.</p>
<pre>
[trace] 5 ?- test_environment.
 T Call: (7) test_environment
   Call: (7) test_environment ? creep
   Call: (8) run_environment([handempty, counter(2)], update_state, [planner], termination) ? creep
   Call: (9) get_percepts([planner], [handempty, counter(2)], _L214) ? creep
   Call: (10) get_percepts([], [handempty, counter(2)], _L238) ? creep
   Exit: (10) get_percepts([], [handempty, counter(2)], []) ? creep
   Call: (10) lists:append([percept(planner, [handempty, counter(2)])], [], _L214) ? creep
   Exit: (10) lists:append([percept(planner, [handempty, counter(2)])], [], [percept(planner, [handempty, counter(2)])]) ? creep
   Exit: (9) get_percepts([planner], [handempty, counter(2)], [percept(planner, [handempty, counter(2)])]) ? creep
   Call: (9) get_actions([percept(planner, [handempty, counter(2)])], _L215) ? creep
   Call: (10) get_actions([], _L237) ? creep
   Exit: (10) get_actions([], []) ? creep
   Call: (10) arg(1, percept(planner, [handempty, counter(2)]), _L238) ? creep
   Exit: (10) arg(1, percept(planner, [handempty, counter(2)]), planner) ? creep
   Call: (10) arg(2, percept(planner, [handempty, counter(2)]), _L239) ? creep
   Exit: (10) arg(2, percept(planner, [handempty, counter(2)]), [handempty, counter(2)]) ? creep
   Call: (10) planner([handempty, counter(2)], _L240) ? creep
   Exit: (10) planner([handempty, counter(2)], noaction) ? creep
   Call: (10) lists:append([action(planner, noaction)], [], _L215) ? creep
   Exit: (10) lists:append([action(planner, noaction)], [], [action(planner, noaction)]) ? creep
   Exit: (9) get_actions([percept(planner, [handempty, counter(2)])], [action(planner, noaction)]) ? creep
   Call: (9) update_state([action(planner, noaction)], [handempty, counter(2)], _L216) ? creep
   Call: (10) update_counter([handempty, counter(2)], _L216) ? creep
   Call: (11) update_counter([counter(2)], _L255) ? creep
   Call: (12) update_counter([], _L278) ? creep
   Exit: (12) update_counter([], []) ? creep
   Call: (12) counter(_G536)=counter(2) ? creep
   Exit: (12) counter(2)=counter(2) ? creep
^  Call: (12) _L280 is 2-1 ? creep
^  Exit: (12) 1 is 2-1 ? creep
   Call: (12) lists:append([counter(1)], [], _L255) ? creep
   Exit: (12) lists:append([counter(1)], [], [counter(1)]) ? creep
   Exit: (11) update_counter([counter(2)], [counter(1)]) ? creep
   Call: (11) counter(_G549)=handempty ? creep
   Fail: (11) counter(_G549)=handempty ? creep
   Call: (11) lists:append([handempty], [counter(1)], _L216) ? creep
   Exit: (11) lists:append([handempty], [counter(1)], [handempty, counter(1)]) ? creep
   Exit: (10) update_counter([handempty, counter(2)], [handempty, counter(1)]) ? creep
   Exit: (9) update_state([action(planner, noaction)], [handempty, counter(2)], [handempty, counter(1)]) ? creep
   Call: (9) termination([handempty, counter(1)]) ? creep
   Call: (10) lists:member(counter(0), [handempty, counter(1)]) ? creep
   Fail: (10) lists:member(counter(0), [handempty, counter(1)]) ? creep
   Fail: (9) termination([handempty, counter(1)]) ? creep
   Call: (9) run_environment([handempty, counter(1)], update_state, [planner], termination) ? creep
   Call: (10) get_percepts([planner], [handempty, counter(1)], _L237) ? creep
   Call: (11) get_percepts([], [handempty, counter(1)], _L261) ? creep
   Exit: (11) get_percepts([], [handempty, counter(1)], []) ? creep
   Call: (11) lists:append([percept(planner, [handempty, counter(1)])], [], _L237) ? creep
   Exit: (11) lists:append([percept(planner, [handempty, counter(1)])], [], [percept(planner, [handempty, counter(1)])]) ? creep
   Exit: (10) get_percepts([planner], [handempty, counter(1)], [percept(planner, [handempty, counter(1)])]) ? creep
   Call: (10) get_actions([percept(planner, [handempty, counter(1)])], _L238) ? creep
   Call: (11) get_actions([], _L260) ? creep
   Exit: (11) get_actions([], []) ? creep
   Call: (11) arg(1, percept(planner, [handempty, counter(1)]), _L261) ? creep
   Exit: (11) arg(1, percept(planner, [handempty, counter(1)]), planner) ? creep
   Call: (11) arg(2, percept(planner, [handempty, counter(1)]), _L262) ? creep
   Exit: (11) arg(2, percept(planner, [handempty, counter(1)]), [handempty, counter(1)]) ? creep
   Call: (11) planner([handempty, counter(1)], _L263) ? creep
   Exit: (11) planner([handempty, counter(1)], noaction) ? creep
   Call: (11) lists:append([action(planner, noaction)], [], _L238) ? creep
   Exit: (11) lists:append([action(planner, noaction)], [], [action(planner, noaction)]) ? creep
   Exit: (10) get_actions([percept(planner, [handempty, counter(1)])], [action(planner, noaction)]) ? creep
   Call: (10) update_state([action(planner, noaction)], [handempty, counter(1)], _L239) ? creep
   Call: (11) update_counter([handempty, counter(1)], _L239) ? creep
   Call: (12) update_counter([counter(1)], _L278) ? creep
   Call: (13) update_counter([], _L301) ? creep
   Exit: (13) update_counter([], []) ? creep
   Call: (13) counter(_G573)=counter(1) ? creep
   Exit: (13) counter(1)=counter(1) ? creep
^  Call: (13) _L303 is 1-1 ? creep
^  Exit: (13) 0 is 1-1 ? creep
   Call: (13) lists:append([counter(0)], [], _L278) ? creep
   Exit: (13) lists:append([counter(0)], [], [counter(0)]) ? creep
   Exit: (12) update_counter([counter(1)], [counter(0)]) ? creep
   Call: (12) counter(_G586)=handempty ? creep
   Fail: (12) counter(_G586)=handempty ? creep
   Call: (12) lists:append([handempty], [counter(0)], _L239) ? creep
   Exit: (12) lists:append([handempty], [counter(0)], [handempty, counter(0)]) ? creep
   Exit: (11) update_counter([handempty, counter(1)], [handempty, counter(0)]) ? creep
   Exit: (10) update_state([action(planner, noaction)], [handempty, counter(1)], [handempty, counter(0)]) ? creep
   Call: (10) termination([handempty, counter(0)]) ? creep
   Call: (11) lists:member(counter(0), [handempty, counter(0)]) ? creep
   Exit: (11) lists:member(counter(0), [handempty, counter(0)]) ? creep
   Exit: (10) termination([handempty, counter(0)]) ? creep
   Call: (10) true ? creep
   Exit: (10) true ? creep
   Exit: (9) run_environment([handempty, counter(1)], update_state, [planner], termination) ? creep
   Exit: (8) run_environment([handempty, counter(2)], update_state, [planner], termination) ? creep
 T Exit: (7) test_environment
   Exit: (7) test_environment ? creep
true.
</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[New Music Search Engine Strays From the Herd]]></title>
<link>http://laurenrugani.wordpress.com/2009/11/16/herd_it/</link>
<pubDate>Mon, 16 Nov 2009 18:39:50 +0000</pubDate>
<dc:creator>Lauren Rugani</dc:creator>
<guid>http://laurenrugani.wordpress.com/2009/11/16/herd_it/</guid>
<description><![CDATA[Searching for new music can be a daunting task. Services like Apple’s iTunes or the website last.fm ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Searching for new music can be a daunting task. Services like <a href="http://www.apple.com/itunes/overview/" target="_blank">Apple’s iTunes</a> or the website <a href="http://www.last.fm/" target="_blank">last.fm</a> make it relatively easy to find music similar to songs or artists you already listen to. But type in “instrumentals for yoga class” and you probably won’t get very far.</p>
<p><a href="http://cosmal.ucsd.edu/~lbarring/" target="_blank"> Luke Barrington</a>, a PhD student at the University of California San Diego, plans to change that with the beta version of his music search engine, <a href="http://herdit.org/music/" target="_blank">Herd It</a>, which he launched last week. His goal is to find and recommend music based on natural-language searches, providing users with both familiar and new songs that share acoustic qualities.</p>
<p>When he recently pitted his recommendation software against Apple’s music recommendation system, Genius, he found that users were equally satisfied with playlists suggested by both methods.</p>
<p>Genius is what Barrington labels a “metadata-based” system. The software associates music with information not necessarily related to the audio content, such as the name of the song, artist and album, and then averages statistics about how many users purchase and play each track. In other words, Genius is just telling you, “the people that listen to this song also listen to that song,” and suggests you do the same.</p>
<p>Barrington’s approach is more similar to the <a href="http://www.pandora.com/#/about" target="_blank">Pandora Radio</a> model, or what he calls a content-based system, which builds playlists with songs that have similar sounds. Humans assign semantic descriptors to each of Pandora’s more than one million songs based on genre, emotion, instruments or vocals. For example, typing “<a href="http://www.kingsofleon.com/" target="_blank">Kings of Leon</a>” into the search box produces a playlist featuring “electric rock instrumentation, a subtle use of vocal harmony, mild rhythmic syncopation, major key tonality and electric rhythm guitars.”</p>
<p>Herd It builds upon both Apple’s <a href="http://en.wikipedia.org/wiki/Crowdsourcing" target="_blank">crowdsourcing</a> mentality and Pandora’s natural-language song characterization, but incorporates machine learning to go beyond the capabilities of either system. Barrington created an algorithm to identify acoustic patterns in a song that predict a semantic tag. It then finds similar patterns in other songs and applies the tag automatically, eliminating the need to tag every song separately.</p>
<p>To this end, Barrington created a <a href="http://apps.facebook.com/herd-it/" target="_blank">game on Facebook</a> that allows users to ascribe qualities to thirty-second long clips of popular songs in a variety of genres. There are nearly 150 tags related to instrumentation, vocals, style, emotion, and even where you would prefer to listen to a song (relaxing at the beach, dancing at a party, while driving, etc.). When enough people independently agree on the same tag for a song, the algorithm learns to assign that tag to songs with similar acoustic patterns.</p>
<p>&#8220;Your definition of why a song is cool or why it goes well with another song may be quite different from mine,&#8221; Barrington said. &#8220;We&#8217;re hoping that the demographic information that we get from <span id="lw_1258392733_6" class="yshortcuts">Facebook</span> will help us to use Herd It data to learn demographic-specific representations of tags, like &#8216;teenage girls from San Diego think that this song rocks&#8217; or &#8216;middleaged housewives from Europe would find this tune romantic&#8217;.&#8221; As the algorithm evolves, Herd It could one day provide personalized recommendations.</p>
<p>The machine gets smarter as more people play the game and as more music is available. Herd It&#8217;s current database is relatively small (about 10,000 songs) but Barrington hopes to partner with a major license holder for access to a more comprehensive collection. Since the algorithm is trained to read only acoustic qualities, it can be compatible with any music service. So while existing music recommendation engines build playlists around a specific song or artist, Herd It, in theory, should be able to create a playlist based on the query, “jazz trumpet melodies for a romantic dinner.”</p>
<p>There is even an option on Herd It for artists to upload their own music, which automatically receives relevant tags and is just as likely to appear on a playlist as any popular song with the same tags – an attribute not available through Genius, which requires that any new song they wish to add to their collection must first be listened to by a large number of users.</p>
<p>Barrington&#8217;s ultimate music recommendation engine would incorporate aspects of Genius&#8217; <a href="http://en.wikipedia.org/wiki/Collaborative_filtering" target="_blank"><span id="lw_1258395589_5" class="yshortcuts" style="border-bottom:1px dashed #0066cc;cursor:pointer;">collaborative filtering</span></a> that leverage huge amounts of user ratings and information that is hard to extract from just listening to the audio (e.g., popularity, release data, artist similarity, etc.), into Herd It so that any new song can immediately be added and recommended in the same context as older or more popular tunes. “Our system doesn&#8217;t know anything that the average music fan is aware of,&#8221; said Barrington. &#8220;Once we add that information in, we think we can build something that is really smarter than Genius.&#8221;</p>
<p><em>Play the Herd It game on Facebook at <a href="http://apps.facebook.com/herd-it/" target="_blank">http://apps.facebook.com/herd-it/</a> or request to try out the Herd It music discovery engine <a href="http://herdit.org/music/#" target="_blank">here</a>.<br />
</em></p>
<div id="_mcePaste" style="overflow:hidden;position:absolute;left:-10000px;top:163px;width:1px;height:1px;">Genius requires that any new song they wish to add to their collection must first be listened to by a large number of users.</div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Not quite according to plan]]></title>
<link>http://magsol.wordpress.com/2009/11/14/not-quite-according-to-plan/</link>
<pubDate>Sun, 15 Nov 2009 03:53:18 +0000</pubDate>
<dc:creator>magsol</dc:creator>
<guid>http://magsol.wordpress.com/2009/11/14/not-quite-according-to-plan/</guid>
<description><![CDATA[I&#8217;ve spent the better part of the last few hours reinstalling my Ubuntu virtual machine from s]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I&#8217;ve spent the better part of the last few hours reinstalling my Ubuntu virtual machine from scratch after completely botching my previous install&#8217;s configuration. How? I was attempting to get Python 2.5 up and running with a few custom packages, and ended up accidentally removing everything Python-related, which included many rather important system packages. Synaptic then froze up, and most of the basic system operations stopped working.</p>
<p>Freaking awesome. Just how I wanted to spend my Saturday evening.</p>
<p>But now that I have it working again, I wanted to delve into my latest project: <a href="http://www.magsolweb.net/wiki/index.php5?title=SpamBot" target="_blank">a semi-intelligent Twitterbot</a>! There are three core components to this project:</p>
<ol>
<li>Read the public Twitter timeline to accumulate posts.</li>
<li>Build a Markov Model out of all accumulated posts.</li>
<li>Use a cronjob to modulate the frequencies of #1 and #2.</li>
</ol>
<p>I&#8217;ve posted about <a href="http://magsol.wordpress.com/2009/03/05/you-wanted-nerdulance-geekery/" target="_blank">Hidden Markov Models before</a>, and this is an example of theory put into practice. Granted, the utility of this application is questionable, but if for no other reason, it sure is entertaining. In fact, since activating my Twitterbot a little over a week ago, it&#8217;s already garnered a decent response. Here are some of my favorites thus far:</p>
<p><img class="aligncenter size-full wp-image-530" title="lulz" src="http://magsol.wordpress.com/files/2009/11/lulz.png" alt="lulz" width="480" height="885" /></p>
<p>It&#8217;s endlessly amusing to me that so many people seem to think my bot is actually a person. At least a few also seem to be amused by its antics. Still others respond as though nothing is amiss. It&#8217;s also managed to flag down multinational Twitter users. And it&#8217;s even attracted the attention of other bots!</p>
<p>How does it work, ya say? Welllllll&#8230;</p>
<p>The underlying assumption of HMMs is that there is a hidden state that influences whatever the output we actively observe is. Within this context, it means we&#8217;re assuming there&#8217;s an unobservable pattern to the sentences Twitter users post that results in the actual words we can see. Thus, if we observe enough of these posts, we should, in theory, be able to infer those hidden states.</p>
<p>Yeah yeah, that wasn&#8217;t very simply put. Nevertheless, let&#8217;s move on.</p>
<p>The assumption my bot makes is pretty straightforward: each word that is observed depends only on the word before it. Put another way, this means that, given a single word, there is only a certain number of words that can come after it. Of these finite number of words that can come after it, some are much more likely than others.</p>
<p>This makes intuitive sense. Take any one of the sentences in this post, for example: after you read one word, you&#8217;re already expecting a certain word or number of words that could follow it (it&#8217;s how we read, in fact; ever heard that humans only actively read about 70% of the words on a  page? all the others are inferred by this same method). It&#8217;s basically a primitive form of contextual analysis.</p>
<p>From a technical standpoint, this dependence on only a single previous word is called a &#8220;first-order&#8221; Markov Model. HMMs can go as high as you&#8217;d like. There is another similar Twitterbot built by a friend of mine which uses a &#8220;second-order&#8221; Markov Model, in that each word depends on the <em>two</em> previous words, resulting in a sentence that probably makes more sense than mine will. But for those of you who ahead of me, this also means much more of the original posts used to build the HMM will show up in the generated posts.</p>
<p>And honestly, I wanted my bot&#8217;s posts to be as random as possible while still <em>kind</em> of making sense <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' />  Hence I purposefully implemented a first-order model.</p>
<p>My bot accumulates 800 posts from the public feed over 20 minutes, then uses those posts to build a first-order HMM. From that model, it then constructs a post by sampling from the model, and posts it to the Twitter account.</p>
<p>If you&#8217;re interested in following the bot, you can find it <a href="http://twitter.com/waffleskatez0rz" target="_blank">here</a>.</p>
<p>I&#8217;m in the process of refining the current model, perhaps a hybrid first-second order HMM. I may also include some topics that are weighted more heavily than others, so the generated posts more accurately reflect the trending topics. And of course, I&#8217;m open to suggestions!</p>
<p>Yes, this bot provides a wonderful source of amusement, particularly since I am way over my head in schoolwork. Applying for jobs, applying to PhD programs, conducting research with Dr Murphy, and actually keeping up with my coursework is all proving very difficult to juggle these last few weeks of the semester. So it&#8217;s nice to have a new joke to read every 20 minutes!</p>
<p>I also want to mention that, a few days ago, my total hits on this blog surpassed 20,000. Thank you again to all those who seem to find something on this blog interesting <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p><img class="aligncenter size-full wp-image-532" title="sports-pictures-denver-broncos-men-tights" src="http://magsol.wordpress.com/files/2009/11/sports-pictures-denver-broncos-men-tights1.jpg" alt="sports-pictures-denver-broncos-men-tights" width="497" height="327" /></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Regression Approaches]]></title>
<link>http://tr8dr.wordpress.com/2009/11/13/probabilistic-regressors-the-mean/</link>
<pubDate>Fri, 13 Nov 2009 20:14:43 +0000</pubDate>
<dc:creator>tr8dr</dc:creator>
<guid>http://tr8dr.wordpress.com/2009/11/13/probabilistic-regressors-the-mean/</guid>
<description><![CDATA[One of the readers of this blog (skauf) suggested looking into KRLS (Kernel Recursive Least Squares)]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>One of the readers of this blog (skauf) suggested looking into KRLS (Kernel Recursive Least Squares), which is an &#8220;online&#8221; algorithm for multivariate regression closely related to SVM and Gaussian Processes approaches.    What struck me as clever in the KRLS algorithm is that it incorporates an online sparsification approach.  Why this is important, I will attempt to explain briefly in the next section.</p>
<p>The sparsification approach is similar in effect to PCA in that it reduces dimension and determines the most influential aspects of your system.  I had been thinking about a combined PCA / basis decomposition approach for some time, and it struck me that KRLS might be the way to do it.</p>
<p><strong>Kernal-based Learning Algorithms</strong><br />
KRLS and SVM algorithms use a common approach to classification (or regression).   In the simplest case, let us assume that we want to find some function f(x) that will classify n-dimensional vectors as belonging to one of 2 sets (apples or oranges), which we will distinguish by + and &#8211; values from f(x):</p>
<p style="padding-left:30px;"><img class="alignnone size-full wp-image-197" title="Picture 1" src="http://tr8dr.wordpress.com/files/2009/11/picture-16.png" alt="Picture 1" width="186" height="46" /></p>
<p>Assuming the vectors X are linearly separable, we could try to find a hyperplane on the n-dimensional space that would separate the vectors such that there is a balance of distance between the two categories (and one that is maximal subject to some constraints).   The distance between the hyperplane (or in 2 dimensions a line) and the points is determined by a <strong>margin</strong> function.</p>
<p><img class="alignnone size-full wp-image-198" title="Picture 2" src="http://tr8dr.wordpress.com/files/2009/11/picture-27.png" alt="Picture 2" width="323" height="267" /></p>
<p>Optimization is accomplished by maximizing the <strong>margins</strong> subject to constraints.  The optimization solves for the weights α  such that the sum of the weighted dot product of  X with each Xi yields the predicted value.  The form of f(x) is the following:</p>
<p style="padding-left:30px;"><img class="alignnone size-full wp-image-199" title="Picture 3" src="http://tr8dr.wordpress.com/files/2009/11/picture-36.png" alt="Picture 3" width="139" height="33" /></p>
<p>Now the above graph shows a very well-behaved data set, with a clear boundary for classification.  Many data sets, however, will have significant noise in the data and/or outliers that refuse to be categorized correctly.   This data has an overall impact on the regression line.</p>
<p>Supposing we have N samples in our training set {{X1,Y1}, {X2,Y2}, &#8230; {Xn, Yn}}.   In simple terms, sparsification is the process of removing (or discounting) samples that are already represented in the existing set, reducing the bias of the regressor.   One approach to determining the degree of representation is to observe the &#8220;degree of orthogonality&#8221; represented by a given vector with respect to the existing set.   Sparsification also points to an approach to evolving the regressor over time relative to new observations.</p>
<p><strong>The Kernel<br />
<span style="font-weight:normal;">Above gave a brief overview of how SVM uses the notion of margin to classify data.  A kernel is not necessary for data that is linearly separable.   The kernel is &#8220;simply&#8221; a function that maps data from the &#8220;attribute&#8221; space to &#8220;feature&#8221; space.   We design or choose the kernel function so that our data is largely linearly separable in the &#8220;feature&#8221; space and with the constraint that the covariance matrix of all possible vectors mapped from attribute to feature space will be positive semi-definite.</span></strong></p>
<p><strong><span style="font-weight:normal;">Since our linear SVM equations are expressed in terms of inner products, given a feature mapping function Φ(x) mapping X from non-linear space to &#8220;linearly separable space&#8221;, we can express the kernel as a function of the inner product of two vectors, later to plug in to our linear equations:</span></strong></p>
<p style="padding-left:30px;"><strong><span style="font-weight:normal;"><img class="alignnone size-full wp-image-201" title="Picture 5" src="http://tr8dr.wordpress.com/files/2009/11/picture-53.png" alt="Picture 5" width="250" height="24" /></span></strong></p>
<p><strong><span style="font-weight:normal;">The trick is to choose a kernel function that maximizes dispersion of the data into sets that are linearly separable.</span></strong></p>
<p><strong><span style="font-weight:normal;"><strong>How does this relate to an explicit probability based regression?</strong><br />
It turns out that the process of optimizing the margin function on a Gaussian kernel  is equivalent to finding the unnormalized maximum likelihood, whereas the Gaussian Process approach makes this explicit.</span></strong></p>
<p><strong><span style="font-weight:normal;"><strong>Which model does one choose?<br />
</strong>The Gaussian Process approach is strictly bayesian.   It has the upside of providing explicit probabilities / confidence measures on predictions.  If one knows the likelihoods of the desired labels with respect to the attribute vectors, the model works very well and provides more information than the SVM family.</span></strong></p>
<p><strong><span style="font-weight:normal;">The SVM family of approaches use a margin function to determine the similarity between vectors (for the purpose of classification).   This does not explicitly involve probabilities, but for the Gaussian kernel can be shown to be equivalent to the Gaussian Process approach.   The SVM family has the upside that one does not need to know the likelihoods.</span></strong></p>
<p><strong><span style="font-weight:normal;">Finally, both models allow the use of kernels to map data from a non-normal or non-linear space to a linear or gaussian space.   Some have shown that there is a degree of equivalence between the likelihood function in the Gaussian Process algorithm and the kernel function used in SVM. </span></strong></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[A Brief History of Machine Learning (2)]]></title>
<link>http://someoneatsomewhere.wordpress.com/2009/11/11/a-brief-history-of-machine-learning-2/</link>
<pubDate>Wed, 11 Nov 2009 17:18:30 +0000</pubDate>
<dc:creator>someoneatsomewhere</dc:creator>
<guid>http://someoneatsomewhere.wordpress.com/2009/11/11/a-brief-history-of-machine-learning-2/</guid>
<description><![CDATA[V. Vapnik has created a new field of statistical learning theory to guarantees the accuracy of Suppo]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>V. Vapnik has created a new field of <strong>statistical learning theory</strong> to guarantees the accuracy of Support Vector Machines (SVMs ) on future data (a generalization ability), i.e. he gave proofs of an error bound of SVMs. His proofs are in fact generalizations of classical laws of large number. Note that the classical laws of large number can be employed to predict generalization abilities of a finite set of predictors from their statistics on training data, but not for a continuum set of predictors. A situation like this is common for &#8220;<strong>learning the best predictor</strong>&#8221; tasks of machine learning and pattern recognition. The new laws of large number for some common continuum sets allow us to solve this prediction tasks and hence advance both machine learning and statistics to another level.</p>
<p>Classification and regression are prime examples of this learning problem. I want to note that, for a regression problem, which is one important and common problem in human history, is indeed advanced by this result as, since the time of Euler, Laplace and Gauss who invented least square algorithms, we have had no guarantee on a generalization ability of any regression predictors at all.</p>
<p>Consequently, a strong theoretical link between machine learning and statistics has created under the name of statistical learning theory. Recent publications on this topic can be found via the <em>Journal of Machine Learning Research</em> (JMLR) and the <em>Computational Learning Theory</em> (COLT) conference (see note 3).</p>
<p><strong>NOTE</strong></p>
<p>1. Ones interested in a bound of SVMs can look at R. Schapire&#8217;s excellent class notes: <a href="http://www.cs.princeton.edu/courses/archive/spring08/cos511/schedule.html">http://www.cs.princeton.edu/courses/archive/spring08/cos511/schedule.html</a></p>
<p>2. There are other prominent researchers who also make some connections between machine learning and statistics. For examples, J. Friedman and L. Breiman invented, among others, a regression tree algorithm named CART, and recently ensemble methods such as <strong>bagging</strong> and <strong>boosting</strong>. T. M. Cover, P. E. Hart and L. Devroye established a generalization ability of classifiers such as <strong>nearest neighbors</strong>, and L. Valiant introduces a <strong>PAC learning framework</strong>.</p>
<p>3. Actually, in the current situation, computer scientists seem to be active in learning theory more than statisticians, and hence they merge the PAC framework with statistical learning theory. We then have the COLT community as a result.</p>
<p>4. Useful Links</p>
<p>JMLR: http://jmlr.csail.mit.edu/<br />
COLT: http://www.learningtheory.org/</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[A Brief History of Machine Learning (1)]]></title>
<link>http://someoneatsomewhere.wordpress.com/2009/11/11/what-is-machine-learning-history-1/</link>
<pubDate>Wed, 11 Nov 2009 07:18:00 +0000</pubDate>
<dc:creator>someoneatsomewhere</dc:creator>
<guid>http://someoneatsomewhere.wordpress.com/2009/11/11/what-is-machine-learning-history-1/</guid>
<description><![CDATA[In my opinion, a good way to describe today machine learning is to say that it is a &#8220;modern  s]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>In my opinion, a good way to describe today machine learning is to say that it is a &#8220;<strong>modern  statistics</strong>&#8220;. Although this is not an accurate definition, it can give a good first impression to ones outside the field. In the first few articles, I try to explain the historical connection between the two research fields.</p>
<p>According to the contents in Tom Mitchel&#8217;s <em>Machine Learning</em> book, machine learning (ML) originally belongs to computer science, and its emphasis on symbolic learning/processing make ML different from statistics. Examples of popular symbolic learners are decision tree, version space, and inductive logic programming.</p>
<p><em><strong>Perceptron</strong></em>, the ancestor of the famous neural networks, is also invented by ML researchers. Although, perceptron is designed to attack one classical problem of statistics, namely, (linear) classification. The original goal of perceptron does not relate statistical ideas. The goal of perceptron is: if an input data is <em>linearly separable</em>, perceptron guarantees to find one linear separator; however, perceptron is not interested in <strong>&#8220;which separator is the best for unseen future data&#8221;</strong>. It is Vladimir Vapnik and colleagues who later in 1990s present the way to select the best separator, the legendary <strong>support vector machines (SVMs)</strong>, by using a rigorous statistical idea, and connect ML to statistics ever since.</p>
<p>SVM has dramatically impacted the whole field of machine learning. Various ideas of SVM establish their own areas, which are new and big branches in ML. Popular examples which are extremely active in 2000s include &#8220;large-margin predictors&#8221;, &#8220;kernel machines&#8221;, &#8220;convex-programming learners&#8221; and &#8220;statistical learning theory&#8221;. Because of these newcomers we may say that classical ML research on symbolic learning are gradually died out as in best ML conferences and journals such as JMLR, ICML and NIPS, we can hardly see any papers on symbolic learning at all.</p>
<p>In fact, Bayesian probability and statistics also have their important places in ML originated from Bayesian casual networks and Bayesian neural networks in 1990s. I will postpone this to the next article so that this article will not be too long.</p>
<p><strong>NOTES</strong></p>
<p>1. In this blog, I may use machine learning, pattern recognition, pattern analysis and data mining interchangeably since nowadays it is difficult to make a sharp boundary among them. Note that I am not an expert so my explanation here can be inaccurate in some degrees.</p>
<p>2. Note that, unfortunately, there is no exact definition of statistics as statisticians are divided into many groups depending on what they believe about the philosophy of induction. For examples, there are Bayesian statistics, frequency statistics, decision-theoretic statistics and measure-theoretic statistics. This will be an interesting topic to discuss later, but as most of the people do not realize about this severe divisions of statistics anyway, so I will just say &#8220;statistics&#8221; here.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Another open source for CRF]]></title>
<link>http://mmannot.wordpress.com/2009/11/09/another-open-source-for-crf/</link>
<pubDate>Mon, 09 Nov 2009 10:04:23 +0000</pubDate>
<dc:creator>Ali Reza Ebadat</dc:creator>
<guid>http://mmannot.wordpress.com/2009/11/09/another-open-source-for-crf/</guid>
<description><![CDATA[Main web page : http://mallet.cs.umass.edu/ http://mallet.cs.umass.edu/sequences.php]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Main web page : <a href="http://mallet.cs.umass.edu/">http://mallet.cs.umass.edu/</a></p>
<p><a title=" A sample" href="http://mallet.cs.umass.edu/sequences.php" target="_self"> http://mallet.cs.umass.edu/sequences.php</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[The Design (Purpose) of The Internet?]]></title>
<link>http://imonad.wordpress.com/2009/11/08/the-design-purpose-of-the-internet/</link>
<pubDate>Sun, 08 Nov 2009 23:59:03 +0000</pubDate>
<dc:creator>JohnBrian</dc:creator>
<guid>http://imonad.wordpress.com/2009/11/08/the-design-purpose-of-the-internet/</guid>
<description><![CDATA[I found this (&#8230;while looking for a semi-gift-resonant way to make some money) and had to add m]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I found this (&#8230;while looking for a semi-gift-resonant way to make some money) and had to add my $0.02.</p>
<p><a title="New window will open" href="http://www.linkedin.com/redirect?url=http%3A%2F%2Fbaris%2Etypepad%2Ecom%2Fventure_capitalist%2F2009%2F10%2Fthe-internet-was-designed-for-the-pc%2Ehtml&#38;urlhash=RBNW" target="_blank">http://baris.typepad.com/venture_capitalist/2009/10/the-internet-was-designed-for-the-pc.html</a></p>
<p>My Comment:</p>
<p>&#8220;I live’n’breathe in the &#8220;why” and the “design/purpose&#8221; and the technology space – I love it here.</p>
<p>My work and writing speaks to essential and magical pragmatism (Yes, pragmatics can be magical, ask Williams James) of the “why” of the internet” as a function of an innate and unstoppable, although seemingly detour prone, process of individual, innate and gift-based human evolution… and whole human-systems gift-based or gift-centric social evolution, and more specifically, the evolution of 1) an individual and naturally disposed and gifted human mind, and 2) all of these human minds doing that same thing.</p>
<p>… and a “why the internet exists” design/purpose founded on the innate NEED of all these evolving and inherently gift-driven-at-the-core human minds finding fuller and better and more synergy-rich “connecting” along the way.</p>
<p>My innately disposed preference is to think of the internet as simply a mind-mirror and interactive functional projection and manifestation thereof, the human mind. And over time, guess what, it actually looks more and more like a human cortex, this felt-work of connective tissue, of fibrils, like glial cells and dendridic networks, enmeshed and encapsulating directly, and indirectly and over time, all minds on the planet. And these minds “connected” &#8211; consciously, most not &#8211; are trying really, really hard &#8211; as it is what they are designed to do… is their ultimate and pre-ordained purpose &#8211; to be all they can be&#8230; to be innately and naturally gift-driven and along the way, trying very, very hard to more effectively resonate (socnet) and connect (socnet) to optimizing “partners” along the way, and over generations of our humanity.</p>
<p>It is this fundamental process and direction of human evolving that feeds the evolution of da’Net itSelf, and reciprocally, the further evolution of the natural and innately disposed and gifted human Self.</p>
<p>Nice that the internet helps in all this, and was mirror-designed in fact, for that express purpose, whether we see it as such or not, i.e. it’s purpose within and without and pertaining to the unknown.</p>
<p>Okay, done spewing.</p>
<p>Thanks<br />
Brian&#8221;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Machine Learning summer school 2009 slides and  videos]]></title>
<link>http://mmannot.wordpress.com/2009/11/07/machine-learning-summer-school-2009-slides-and-videos/</link>
<pubDate>Fri, 06 Nov 2009 23:09:38 +0000</pubDate>
<dc:creator>Ali Reza Ebadat</dc:creator>
<guid>http://mmannot.wordpress.com/2009/11/07/machine-learning-summer-school-2009-slides-and-videos/</guid>
<description><![CDATA[http://videolectures.net/mlss09uk_cambridge/﻿]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://videolectures.net/mlss09uk_cambridge/" target="_blank">http://videolectures.net/mlss09uk_cambridge/﻿</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Approximation Algorithms for the Large-Scale Application of Physiognomy]]></title>
<link>http://jnormativept.wordpress.com/2009/11/02/approximation-algorithms-for-the-large-scale-application-of-physiognomy/</link>
<pubDate>Mon, 02 Nov 2009 04:30:29 +0000</pubDate>
<dc:creator>aif</dc:creator>
<guid>http://jnormativept.wordpress.com/2009/11/02/approximation-algorithms-for-the-large-scale-application-of-physiognomy/</guid>
<description><![CDATA[Approximation Algorithms for the Large-Scale Application of Physiognomy A plate of facial-character ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><strong>Approximation Algorithms for the Large-Scale Application of Physiognomy</strong></p>
<div id="attachment_31" class="wp-caption aligncenter" style="width: 316px"><img src="http://jnormativept.wordpress.com/files/2009/11/picture-1.png" alt="Physiognomy" title="Physiognomy" width="306" height="539" class="size-full wp-image-31" /><p class="wp-caption-text">A plate of facial-character templates from Johann Caspar Lavater 1826 treatise 'Physiognomy.'</p></div>
<p><strong>Abstract</strong></p>
<blockquote><p>Over the past 15 years, advances in computer vision have enabled impressive progress in computer-based security technology and vastly improved human security. Computers can now identify weapons in images and video, track individuals through a series of scenes in order to predict their potentially nefarious aims, and collate individuals in live video against suspects in a massive database. Though there is no doubt that we are safer as a result of these advances, they fall short of providing true security. While identifying individuals in video is helpful, one still requires a database of suspects, which can be inaccurate or out of date. Wouldn&#8217;t it be simpler and more effective to dispense with the database entirely and inspect the character of the individual by the form of their face and skull? Thanks to recent advances in machine learning and computer vision technologies, this is now achievable. This paper presents GNOMON, a system for the application of the ancient science of physiognomy to live video feeds.</p></blockquote>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Decision Tree Learning]]></title>
<link>http://tr8dr.wordpress.com/2009/11/01/decision-tree-learning/</link>
<pubDate>Mon, 02 Nov 2009 01:33:45 +0000</pubDate>
<dc:creator>tr8dr</dc:creator>
<guid>http://tr8dr.wordpress.com/2009/11/01/decision-tree-learning/</guid>
<description><![CDATA[Some time ago began using bayesian networks (decision trees with conditional relationships) to boost]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Some time ago began using bayesian networks (decision trees with conditional relationships) to boost signal for strategies, the idea being that a combination of related observations can be combined conditionally to provide a posterior conclusion with a higher degree of confidence.</p>
<p>In the context of trading our bayesian network tells us:</p>
<ol>
<li>given the coincident set of events should we trade and if so, in what direction</li>
<li>with what degree of confidence</li>
</ol>
<p>My approach up until now had been to carefully construct the relationships.  It requires painstaking research and a lot of trial and error.   Why not use an algorithm to assemble / find relationships, and classify data.</p>
<p><strong>General Approaches</strong><br />
The more general decision tree algorithms allow me to present many factors (even with high dimension), determine which factors have the highest degree of information relative to our classification targets (buy, sell, don&#8217;t-trade), and formulate a classification or regression that models this.</p>
<p>Key to the assembly on the tree are measures taken from information theory.  We want to make arrangements in the tree such that the &#8220;entropy&#8221; of the tree is minimized or in other words information is maximized.   This is often calculated with a discrete form of the <a href="http://en.wikipedia.org/wiki/Kullback–Leibler_divergence">Kullback-Leibler divergence metric</a>.</p>
<p><span style="font-weight:normal;font-size:13px;"><strong>Algorithm</strong><br />
In particular the &#8220;<a href="http://en.wikipedia.org/wiki/Random_forest">Random Forest</a>&#8221; approach is quite appealing.   Multiple decision trees are constructed against training subsets drawn randomly from a larger training set.   These ultimately produce many variations on the &#8220;true&#8221; model.   The approximation to the true model is made by observing the mode (similar to a robust mean / expectation) across the random trees.</span></p>
<p>Taking advantage of the classification ability of the algorithm is going to allow me to try many new inputs in my strategies without the huge research overhead and without the near-intractable multivariate optimization required in other approaches.   I&#8217;ll post some results on this soon.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Progress Report - October 2009]]></title>
<link>http://aiguy.wordpress.com/2009/11/01/progress-report-october-2009/</link>
<pubDate>Sun, 01 Nov 2009 16:31:54 +0000</pubDate>
<dc:creator>aiguy</dc:creator>
<guid>http://aiguy.wordpress.com/2009/11/01/progress-report-october-2009/</guid>
<description><![CDATA[In this month, I continued my studies in Relational Reinforcement Learning by reviewing the article ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>In this month, I continued my studies in Relational Reinforcement Learning by reviewing the article Towards Informed Reinforcement Learning from the proceedings of the 2004 Machine Learning workshop of Relational Reinforcement Learning.   Basically the articles summarizes that an agent with limited information can find an optimal policy and can achieve a goal or goal states with limited information about its environment.  The experiments reported seems to suggest this type of exploration is possible.  According to Google Scholar search, there are 11 subsequent articles that reference this one.  In the RRL arena, my goal is to repeat the block&#8217;s world experiment as reported in Relational Reinforcement Learning article by Dzeroski, De Raedt, and Blockeel.</p>
<p><!--more-->Also, I have been focused on Robert Winkler&#8217;s book <em>An Introduction to Bayesian Analysis and Inference</em>.  From the Winkler book, my goal is to gain a better understanding of Bayesian Inference to better understand the Decision Theoretic models of machine learning.  In the machine learning world, Bayesian approaches to various problems are resulting interesting solutions to various problems in Multiagent Reinforcement Learning.</p>
<p>There are some interesting articles in the Journal of Bayesian Analysis and in JAIR.</p>
<p>In another topic of interest, I am continuing my reintroduction of LISP by reading the Patrick Winston and Berthold Horn classic.</p>
<p>From a statistical point of view, this month achieve a new record number of hits with 806 hits, boosting my total for the current year to over 3200 hits.  The hotest pages are my About page, followed by Wumpus World and Wumpus World Revisited.</p>
<p>Thank you for your interest and support.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[SVM - Sơ lược]]></title>
<link>http://hoanggiavu.wordpress.com/2009/10/30/svm-s%c6%a1-l%c6%b0%e1%bb%a3c/</link>
<pubDate>Fri, 30 Oct 2009 17:21:50 +0000</pubDate>
<dc:creator>hoanggiavu</dc:creator>
<guid>http://hoanggiavu.wordpress.com/2009/10/30/svm-s%c6%a1-l%c6%b0%e1%bb%a3c/</guid>
<description><![CDATA[Tóm tắt về SVM: 1. Large margin Ý tưởng chính của  SVM là tìm 1 maximum margin hyperplane. Giả sử hy]]></description>
<content:encoded><![CDATA[Tóm tắt về SVM: 1. Large margin Ý tưởng chính của  SVM là tìm 1 maximum margin hyperplane. Giả sử hy]]></content:encoded>
</item>
<item>
<title><![CDATA[Keynote by tom Mitchell at ISWC 2009]]></title>
<link>http://streamtracker.wordpress.com/2009/10/28/keynote-by-tom-mitchell-at-iswc-2009/</link>
<pubDate>Wed, 28 Oct 2009 13:58:34 +0000</pubDate>
<dc:creator>smitashree</dc:creator>
<guid>http://streamtracker.wordpress.com/2009/10/28/keynote-by-tom-mitchell-at-iswc-2009/</guid>
<description><![CDATA[keynote by Tom Mitchell about populating semantic web using machine Learning Points made: manually e]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>keynote by Tom Mitchell about populating <a class="zem_slink" title="Semantic Web" rel="wikipedia" href="http://en.wikipedia.org/wiki/Semantic_Web">semantic web</a> using <a class="zem_slink" title="Machine learning" rel="wikipedia" href="http://en.wikipedia.org/wiki/Machine_learning">machine Learning</a></p>
<p>Points made:</p>
<p><strong>manually enter by users<br />
convert from Database<br />
computer reading of unstructured data:</strong></p>
<p><strong>         use an initial <a class="zem_slink" title="Ontology" rel="wikipedia" href="http://en.wikipedia.org/wiki/Ontology">ontology</a><br />
         leverage redudancy available on web<br />
        try to target the ontology to populate<br />
        semi-supervised learning</strong> : couples semi-supervised learning (bootstrap learning) for category and relation    extraction.<br />
<strong>        use seed examples from Freebase or <a class="zem_slink" title="DBpedia" rel="homepage" href="http://dbpedia.org/">dbpedia</a>.</strong></p>
<p><strong>here comes o my mind can we populate existing <a class="zem_slink" title="Large Scale Concept Ontology for Multimedia" rel="wikipedia" href="http://en.wikipedia.org/wiki/Large_Scale_Concept_Ontology_for_Multimedia">LSCOM</a> ontology from youtube data ?  this can be good candidate for the next paper. </strong></p>
<p>&#160;</p>
<h6 class="zemanta-related-title" style="font-size:1em;"> </h6>
<p class="zemanta-article-ul-li"> </p>
<div class="zemanta-pixie" style="margin-top:10px;height:15px;"><a class="zemanta-pixie-a" title="Reblog this post [with Zemanta]" href="http://reblog.zemanta.com/zemified/8d4ed5d4-2713-4cce-bfe0-fd7596f71947/"><img class="zemanta-pixie-img" style="float:right;" src="http://img.zemanta.com/reblog_e.png?x-id=8d4ed5d4-2713-4cce-bfe0-fd7596f71947" alt="Reblog this post [with Zemanta]" /></a></div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Hyperclimbing, Genetic Algorithms, and Machine Learning ]]></title>
<link>http://blog.hackingevolution.net/2009/10/27/hyperclimbing/</link>
<pubDate>Tue, 27 Oct 2009 12:59:34 +0000</pubDate>
<dc:creator>Keki</dc:creator>
<guid>http://blog.hackingevolution.net/2009/10/27/hyperclimbing/</guid>
<description><![CDATA[I&#8217;ve identified a promising stochastic search heuristic called hyperclimbing for optimizing ov]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I&#8217;ve identified a promising stochastic search heuristic called <em>hyperclimbing</em> for optimizing over discrete product spaces (e.g. the space of binary strings of some fixed length) with noisy objective functions. Hyperclimbing works by recursively sifting through large numbers of partitions of the search space and by identifying ones with variegated expected objective values. Because hyperclimbing is sensitive, not to the local features of a search space, but to certain more global statistics, it is not susceptible to the kinds of issues that waylay local search heuristics. The only visible barrier to the wide and enthusiastic use of hyperclimbing is that it seems to scale poorly with the size of the search space; when one heeds the seemingly high cost of applying hyperclimbing to large search spaces, this heuristic looses its shine. A key conclusion of my doctoral work is that this seemingly high cost is illusory. I have uncovered evidence that strongly suggests that genetic algorithms can implement hyperclimbing extraordinarily efficiently.</p>
<p>As readers of this blog probably know, genetic algorithms are search algorithms that mimic natural evolution. These algorithms have been used in a wide range of engineering and scientific fields to quickly procure useful solutions to poorly understood (i.e. black-box) optimization problems. Unfortunately, despite the routine use of genetic algorithms for over three decades, their adaptive capacity has not been adequately accounted for. Given the evidence that genetic algorithms can implement efficient hyperclimbing, I&#8217;ve proposed a new explanation for the adaptive capacity of these algorithms. This new account&#8212;<a href="http://cs.brandeis.edu/~kekib/dissertation.html">the generative fixation hypothesis</a>&#8212;promises to spark significant advances in the fields of genetic algorithmics and discrete optimization.</p>
<p>The discovery that hyperclimbing is efficiently implementable also promises to have a non-negligible impact on the ecology of machine learning research. Optimization and machine learning are, after all, intimately related. The practice of machine learning research, can broadly be characterized as the effective reduction of difficult learning problems to optimization problems for which efficient algorithms exist. In other words, the machine learning problems that can effectively be tackled are, in large part, those that can effectively be reduced to optimization problems that can be tackled efficiently. Currently, this largely limits the class of tractable machine learning problems to the class of learning problems that can effectively be reduced to <em>convex</em> optimization problems [1].  The identification of general-purpose non-convex optimization heuristics with efficient implementations (e.g. hyperclimbing), thus, has the potential to greatly extend the reach of machine learning.</p>
<p>For a description of hyperclimbing, and evidence that genetic algorithms can implement this heuristic efficiently, please see my <a href="http://cs.brandeis.edu/~kekib/dissertation.html">dissertation</a></p>
<p>[1]  Kristin P. Bennett and Emilio Parrado-Hernandez. <a href="http://jmlr.csail.mit.edu/papers/volume7/MLOPT-intro06a/MLOPT-intro06a.pdf">The interplay of optimization and machine  learning research</a>. Journal of Machine Learning Research, 7:1265–1281, 2006.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Union Bound, Hoeffding Inequality and Some Bounds in Learning Theory - Part I]]></title>
<link>http://onionesquereality.wordpress.com/2009/10/27/union-bound-hoeffding-inequality-and-some-bounds-in-learning-theory/</link>
<pubDate>Tue, 27 Oct 2009 10:15:07 +0000</pubDate>
<dc:creator>Shubhendu Trivedi</dc:creator>
<guid>http://onionesquereality.wordpress.com/2009/10/27/union-bound-hoeffding-inequality-and-some-bounds-in-learning-theory/</guid>
<description><![CDATA[Well again parts of my notes (modified suitably to be blog posts) for a discussion session! This pos]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p style="text-align:justify;">Well again parts of my notes (modified suitably to be blog posts) for a discussion session!</p>
<p style="text-align:justify;">This post would be the <strong>first in a series of four posts.</strong> The objective of each post would be as follows:</p>
<p style="text-align:justify;"><strong>1.</strong> This post would introduce <a href="http://en.wikipedia.org/wiki/Computational_learning_theory" target="_blank">Learning Theory</a>, the bias-variance trade-off and sum up the need of learning theory.</p>
<p style="text-align:justify;"><strong>2.</strong> This would discuss two simple lemmas : The Union Bound and the <a href="http://cnx.org/content/m16264/latest/" target="_blank">Hoeffding inequality</a> and then use them to get to some very deep results in learning theory. It would also introduce and discuss <a href="http://en.wikipedia.org/wiki/Supervised_learning#Empirical_risk_minimization" target="_blank">Empirical Risk Minimization</a>.</p>
<p style="text-align:justify;"><strong>3.</strong> Continuing from the previous discussion this post would derive results on uniform convergence, tie the discussions into a theorem. From this theorem we would have made formal the bias-variance trade-off discussed in the first post.</p>
<p style="text-align:justify;"><strong>4. </strong>Will talk about <a href="http://en.wikipedia.org/wiki/VC_dimension" target="_blank">VC Dimension</a> and the VC bound.</p>
<p style="text-align:justify;">Basically all the results are derived using two very simple lemmas, hence the name of these posts.</p>
<p style="text-align:center;">______</p>
<p style="text-align:justify;"><span style="text-decoration:underline;"><strong>Introduction:</strong></span></p>
<p style="text-align:justify;">Learning theory helps give a researcher applying machine learning algorithms  some rules of the thumb that tell how to best apply the algorithms that he/she has learnt.</p>
<p style="text-align:justify;"><a href="http://ai.stanford.edu/~ang/" target="_blank">Dr Andrew Ng</a> likens knowing machine learning algorithms to a carpenter acquiring a set of tools. However the difference between a good carpenter and not so good one is the skill in using those tools. In choosing which one to use and how. In the same way Learning Theory gives a &#8220;machine-learnist&#8221; some crude intuitions about how a ML algorithm would work and helps in applying them better.</p>
<p style="text-align:justify;">A lot of people still think of learning theory as a method for getting papers published (I&#8217;d like to use that method, I need papers <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> , as it is considered abstruse by many and not of much practical value. A good refutation of this tendency can be <a href="http://hunch.net/?p=496" target="_blank">seen here</a> on <a href="http://hunch.net/~jl/" target="_blank">John Langford&#8217;s </a>fantastic web-log.</p>
<p style="text-align:center;">______</p>
<p style="text-align:justify;">As put in a popular tutorial by Olivier Bousquet, the process of inductive learning can be summarized as:</p>
<p style="text-align:justify;"><strong>1.</strong> Observe a phenomenon.</p>
<p style="text-align:justify;"><strong>2.</strong> Construct a model of that phenomenon.</p>
<p style="text-align:justify;"><strong>3.</strong> Make predictions using this model.</p>
<p style="text-align:justify;">Dr Bousquet puts it very tersely that the above process can actually said to be the aim of ALL natural sciences. <strong>Machine learning</strong> <strong>aims to <em><span style="text-decoration:underline;">automate</span></em> the process</strong> and<strong> learning theory tries to <em><span style="text-decoration:underline;">formalize</span></em> it. </strong>I think the above gives a reasonable idea about what learning theory deals with.</p>
<p style="text-align:justify;">Learning theory formalizes terms like generalization, over-fitting and under-fitting. This series of posts (read notes) aims to introduce these terms and then jump to a recap of some important error bounds in learning theory.</p>
<p style="text-align:center;">______</p>
<p style="text-align:justify;"><span style="text-decoration:underline;"><strong>Training Error, Generalization Error and The Bias-Variance Tradeoff:</strong></span></p>
<p style="text-align:justify;">For simplicity let&#8217;s take something as simple as <strong>linear regression</strong>. And since I want this piece to be accessible, I assume no knowledge of linear regression either.</p>
<p style="text-align:justify;">Linear Regression essentially models the relationship between one variable <img src='http://l.wordpress.com/latex.php?latex=X&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='X' title='X' class='latex' /> and another variable <img src='http://l.wordpress.com/latex.php?latex=Y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='Y' title='Y' class='latex' /> such that the model itself depends linearly on the unknown parameters to be estimated from the data. Let&#8217;s have a look at what this means:</p>
<p style="text-align:justify;">Suppose you have a habit of collecting weird datasets and you end up collecting up a dataset that gives the circumference of biceps of many men and the distance a javelin is thrown by each of them. And you want to predict for an unknown individual, given the circumference of his biceps how far can he throw the javelin.</p>
<p style="text-align:justify;"><img class="alignright size-medium wp-image-1810" title="javelin" src="http://onionesquereality.wordpress.com/files/2009/10/javelin.jpg?w=230" alt="javelin" width="230" height="300" /><img class="alignleft size-medium wp-image-1811" title="250px-Biceps_887" src="http://onionesquereality.wordpress.com/files/2009/10/250px-biceps_887.jpg?w=224" alt="250px-Biceps_887" width="224" height="300" /></p>
<p style="text-align:justify;">Ofcourse there would be a number of reasons that would affect the distance a javelin would go, such as skill (which is essentially non-quantitative?), height, the kid of footwear worn, run-up distance, state of health etc. These would be the some of the many features that would affect that end result (distance a javelin is thrown). What I essentially mean is that the circumference of the biceps isn&#8217;t a realistic feature to predict how far a javelin can be thrown. But let&#8217;s assume that there is only one feature and it can make <em>reasonable</em> predictions. This over-simplification is only made so that the process can be visualized in a graph.</p>
<p style="text-align:justify;">Suppose you collect about 80 such examples (which you call the training examples) and plot your data as such:</p>
<p style="text-align:center;"><img class="aligncenter size-full wp-image-1813" title="Training Data" src="http://onionesquereality.wordpress.com/files/2009/10/untitled.jpg" alt="untitled" width="470" height="377" /></p>
<p style="text-align:justify;">Now the problem given to you is: Given you have the bicep-circumference measurement of an <em>unknown </em>individual, predict how far he can throw the javelin.</p>
<p style="text-align:justify;">How would one do it?</p>
<p style="text-align:justify;">What we would do is to fit in some curve in the above training set (the above plot). And when we have to make a prediction we simply plug in that value in our curve and find the corresponding value for the distance. Something illustrated below.</p>
<p style="text-align:center;"><img class="aligncenter size-full wp-image-1816" title="Model Fit - Training Data" src="http://onionesquereality.wordpress.com/files/2009/10/untitled1.jpg" alt="untitled" width="500" height="368" /></p>
<p style="text-align:justify;">The curve can be represented in a number of ways. However, if the curve was to be represented linearly (that&#8217;s why it&#8217;s called linear regression) it could be written as :</p>
<p style="text-align:center;"><img src='http://l.wordpress.com/latex.php?latex=h%28x%29+%3D+%5Ctheta_0+%2B+%5Ctheta_1+x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x) = \theta_0 + \theta_1 x' title='h(x) = \theta_0 + \theta_1 x' class='latex' /></p>
<p style="text-align:justify;">Where <img src='http://l.wordpress.com/latex.php?latex=h%28x%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h(x)' title='h(x)' class='latex' /> is the hypothesis, <img src='http://l.wordpress.com/latex.php?latex=%5Ctheta_0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_0' title='\theta_0' class='latex' /> and  <img src='http://l.wordpress.com/latex.php?latex=%5Ctheta_1&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_1' title='\theta_1' class='latex' /> are unknown parameters which are to be learnt from the data and <img src='http://l.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is the input feature. It is noteworthy that this is like the slope intercept form of the line.</p>
<p style="text-align:justify;">In the above, for simplicity I considered only one feature, there could be many more. In the more general case:</p>
<p style="text-align:center;"><img src='http://l.wordpress.com/latex.php?latex=h_%5Ctheta%28x%29+%3D+%5Ctheta_0+%2B+%5Ctheta_1+x_1+%2B+%5Cdotsb+%2B+%5Ctheta_i+x_i+%5Ccdots+%281%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='h_\theta(x) = \theta_0 + \theta_1 x_1 + \dotsb + \theta_i x_i \cdots (1)' title='h_\theta(x) = \theta_0 + \theta_1 x_1 + \dotsb + \theta_i x_i \cdots (1)' class='latex' /></p>
<p style="text-align:justify;">The <img src='http://l.wordpress.com/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta' title='\theta' class='latex' />&#8217;s are called the parameters (to be learnt from the data) that will decide the nature of the curve.</p>
<p style="text-align:justify;">We see that the equation involves features of the training examples (<img src='http://l.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' />&#8217;s), therefore using this, the task of the learning algorithm will be to decide the most optimum values of <img src='http://l.wordpress.com/latex.php?latex=%5Ctheta_i&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\theta_i' title='\theta_i' class='latex' /> using the training set. This can be easily done by something like <a href="http://en.wikipedia.org/wiki/Gradient_descent" target="_blank"><strong>Gradient Descent</strong>.</a></p>
<p style="text-align:justify;">For any new example, we&#8217;d have the features <img src='http://l.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> and parameters would already be known by running gradient descent using the training set. We simply have to plug in the value of <img src='http://l.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> in equation <img src='http://l.wordpress.com/latex.php?latex=%281%29&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='(1)' title='(1)' class='latex' /> to get a prediction.</p>
<p style="text-align:justify;">To sum up : Like I mentioned, we use the training set to fit in a optimal curve and then try to predict unseen inputs by simply plugging in its values to the &#8220;equation of the curve&#8221;.</p>
<p style="text-align:justify;">Now, it goes without saying that we could fit in a <strong>&#8220;simple&#8221; model</strong> to the training set or a more <strong>&#8220;complex&#8221; model</strong>. A simple model would be linear say something like:</p>
<p style="text-align:center;"><img src='http://l.wordpress.com/latex.php?latex=y%3D%5Ctheta_0+%2B+%5Ctheta_1x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y=\theta_0 + \theta_1x' title='y=\theta_0 + \theta_1x' class='latex' /></p>
<p style="text-align:justify;">and a complex model could be something like this:</p>
<p style="text-align:center;"><img src='http://l.wordpress.com/latex.php?latex=y%3D%5Ctheta_0+%2B+%5Ctheta_1+x+%2B+%5Cdotsb+%2B+%5Ctheta_5+x%5E5&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y=\theta_0 + \theta_1 x + \dotsb + \theta_5 x^5' title='y=\theta_0 + \theta_1 x + \dotsb + \theta_5 x^5' class='latex' />.</p>
<p style="text-align:justify;">It&#8217;s to be noted that in the above the same feature <img src='http://l.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> is used in different ways, the second model uses <img src='http://l.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> to create more features such as <img src='http://l.wordpress.com/latex.php?latex=x%5E2%2C+x%5E3&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x^2, x^3' title='x^2, x^3' class='latex' />, and so on. Clearly the second representation is more complex than the first as it will exploit more patterns in the data (it has more parameters).</p>
<p style="text-align:justify;">However this increase in complexity <em>can</em> lead to problems, in the same way if the model is too simple it <em>can</em> lead to problems. This is illustrated below:</p>
<p style="text-align:justify;"><a href="http://onionesquereality.wordpress.com/files/2009/10/untitled2.jpg?w=300"><img class="alignleft size-medium wp-image-1830" title="High Bias Case" src="http://onionesquereality.wordpress.com/files/2009/10/untitled2.jpg?w=300" alt="untitled" width="229" height="180" /></a><a href="http://onionesquereality.wordpress.com/files/2009/10/untitled3.jpg?w=300"><img class="alignright size-medium wp-image-1831" title="High Variance Case" src="http://onionesquereality.wordpress.com/files/2009/10/untitled3.jpg?w=300" alt="untitled3" width="229" height="180" /></a></p>
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:center;">[<strong>Fig 1 </strong>(Left) and <strong>Fig 2</strong> (Right)]</p>
<p style="text-align:justify;">The figure on the left has a &#8220;simple model&#8221; fit into the training set. Clearly there are patterns in the data that the model would never take into account, no matter how big the training set goes. Paraphrasing this in more concrete terms, it&#8217;s clear that the relationship between <img src='http://l.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> and <img src='http://l.wordpress.com/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' /> is not linear. So if we try to fit in a linear model to it, <strong>not matter how much we train it, there would always be some patterns in the data that the model would fail to subsume</strong>.</p>
<p style="text-align:justify;">What this means is, what is learnt from the training set will not be <em><strong>generalised </strong></em>well to unknown examples (this is because, it might be that the unknown example comes from that part of the distribution that the model fails to account for and thus the prediction for it would be very inaccurate).</p>
<p style="text-align:justify;">The figure on the right has a &#8220;complex&#8221; model fit into the same set, clearly the model fits the data very well. But again it is not a good predictor as it does not represent the general nature of the spread of the data but rather takes into account the idiosyncrasies of the same. This model would make very good predictions on the data from the training set itself, but it would not <em><strong>generalize </strong></em>well to unknown examples.</p>
<p style="text-align:justify;">A more appropriate fit would be something like this :</p>
<p style="text-align:justify;"><a href="http://onionesquereality.wordpress.com/files/2009/10/untitled21.jpg"><img class="size-medium wp-image-1841 aligncenter" title="untitled2" src="http://onionesquereality.wordpress.com/files/2009/10/untitled21.jpg?w=300" alt="untitled2" width="300" height="238" /></a>Now we can move to a definition of the <strong>generalization error,</strong> The generalization error of a hypothesis is its <strong><em>expected error on examples that are not from the training set</em></strong>. For an example on understanding generalization refer to the part labeled &#8220;<strong>Van-Gogh Chagall and Pigeons</strong>&#8221; in <a href="http://onionesquereality.wordpress.com/2009/02/13/face-recognition-in-bees/" target="_blank">this post</a>.</p>
<p style="text-align:justify;">The models shown in figures 1 and 2 have HIGH generalization errors. However each suffer from <strong>entirely different problems</strong>.</p>
<p style="text-align:center;">______</p>
<p style="text-align:justify;"><strong>Bias : </strong>Like already mentioned : In the model shown in fig. 1, no matter how much the model is trained, There would always be some patterns in the data that the model would fail to capture. This is because the model has a high <strong><em>BIAS. </em></strong>Bias of a model is the <strong>expected generalization error even if we were to fit in a very large training-set</strong>.</p>
<p style="text-align:justify;">Thus the linear model shown in figure 1 suffers from high bias and <strong>will underfit the data</strong>.</p>
<p style="text-align:justify;"><strong>Variance : </strong>Apart from bias there is another component that has a bearing on the generalization error. That is the variance of the model fit into the training set.</p>
<p style="text-align:justify;">This is shown in fig. 2. We see that even though that the model fits in very well in the training set, there is the risk that we are fitting patterns that are idiosyncratic to the training examples and may not represent the general pattern between <img src='http://l.wordpress.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='x' title='x' class='latex' /> and <img src='http://l.wordpress.com/latex.php?latex=y&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='y' title='y' class='latex' />.</p>
<p style="text-align:justify;">Since we might be fitting spurious patters and exaggerating minor fluctuations in the data, such a model would still give a high generalization error and <strong>will over-fit the data</strong>. In such a case we say that the model has a <strong>high variance</strong>.</p>
<p style="text-align:justify;"><strong>The Trade-off : </strong>When deciding on a model to fit onto the training set, there is a trade-off between the bias and the variance. If either is high that would mean the generalizing ability of the model would be low (generalization error would be high). In other words, if the model is too simple i.e if it has too few parameters it would have a high bias and if the model is too complex it would have a high variance. While deciding on a model we have to strike a balance between the two.</p>
<p style="text-align:justify;">A very famous example that illustrates this trade-off goes like this:</p>
<p style="text-align:justify;"><img class="aligncenter size-medium wp-image-1877" title="Fall Tree" src="http://onionesquereality.wordpress.com/files/2009/10/fall-tree.jpg?w=300" alt="Fall Tree" width="300" height="200" /></p>
<p style="text-align:justify;">[Suppose there is an exacting biologist who studies and classifies green trees in detail. He would be the example of an over-trained or over-fit model and would declare if he sees a tree with non-green leaves like above that it is not a tree at all]</p>
<p style="text-align:justify;"><img class="aligncenter size-medium wp-image-1878" title="Cucumber" src="http://onionesquereality.wordpress.com/files/2009/10/cucumber.jpg?w=300" alt="Cucumber" width="300" height="192" /></p>
<p style="text-align:justify;">[An under-trained or under-fit model would be like the above biologist's lazy brother, who on seeing a cucumber which is green declares that it is a tree]</p>
<p style="text-align:justify;">Both of the above have poor generalization. We wish to select a model that has an appropriate trade-off between the two.</p>
<p style="text-align:center;">______</p>
<p style="text-align:justify;"><span style="text-decoration:underline;"><strong>So why do we need Learning Theory?</strong></span></p>
<p style="text-align:justify;">Learning theory is an interesting subject in its own right. It, however also hones our intuitions on how to apply learning algorithms properly  giving us a set of rules of the thumb that guide us on how to apply learning algorithms well.</p>
<p style="text-align:justify;">Learning theory can answer quite a few questions :</p>
<p style="text-align:justify;"><strong>1.</strong> In the previous section there was a small discussion on bias and variance and the trade-off between the two. The discussion sounds logical, however there is no meaning to it unless it is <em><strong>formalized.</strong></em> Learning theory can formalize the bias variance trade-off. This helps as we can then make a choice on choosing the model with just the right bias and variance.</p>
<p style="text-align:justify;"><strong>2.</strong> Learning Theory leads to <strong>model selection methods</strong> by which we can choose automatically what model would be appropriate for a certain training set.</p>
<p style="text-align:justify;"><strong>3.</strong> In Machine Learning, models are fit on the training set. So what we essentially get is the training error. But what we really care about is the generalization ability of the model or the ability to give good predictions on unseen data.</p>
<p style="text-align:justify;">Learning Theory relates the training error on the training set and the generalization error and it would tell us how doing well on the training set might help us get better generalization.</p>
<p style="text-align:justify;"><strong>4.</strong> Learning Theory actually proves conditions in which the learning algorithms will actually work well. It proves bounds on the worst case performance of models giving us an idea when the algorithm would work properly and when it won&#8217;t.</p>
<p style="text-align:justify;">The next post would answer some of the above questions.</p>
<p style="text-align:center;">______</p>
<p style="text-align:justify;"><a href="http://onionesquereality.wordpress.com/" target="_self"><strong><em>Onionesque Reality</em></strong> Home &#62;&#62;</a></p>
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
<p style="text-align:justify;">
</div>]]></content:encoded>
</item>

</channel>
</rss>
