<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>lucene &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/lucene/</link>
	<description>Feed of posts on WordPress.com tagged "lucene"</description>
	<pubDate>Thu, 03 Dec 2009 01:15:40 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[PHP Conference Brasil 2009, Solr, i-Educar e PHPinga]]></title>
<link>http://eriksencosta.wordpress.com/2009/11/29/php-conference-brasil-2009-solr-ieducar-phpinga/</link>
<pubDate>Mon, 30 Nov 2009 02:11:27 +0000</pubDate>
<dc:creator>Eriksen Costa</dc:creator>
<guid>http://eriksencosta.wordpress.com/2009/11/29/php-conference-brasil-2009-solr-ieducar-phpinga/</guid>
<description><![CDATA[Ontem (28/11), foi encerrada mais uma PHP Conference Brasil 2009. E mais uma vez o evento foi incrív]]></description>
<content:encoded><![CDATA[Ontem (28/11), foi encerrada mais uma PHP Conference Brasil 2009. E mais uma vez o evento foi incrív]]></content:encoded>
</item>
<item>
<title><![CDATA[Coding best practice : Lucene search query : resultSet.close() : part2]]></title>
<link>http://alfrescoshare.wordpress.com/2009/11/27/coding-best-practice-lucene-search-query-resultset-close-part2/</link>
<pubDate>Fri, 27 Nov 2009 10:50:19 +0000</pubDate>
<dc:creator>Enguerrand SPINDLER</dc:creator>
<guid>http://alfrescoshare.wordpress.com/2009/11/27/coding-best-practice-lucene-search-query-resultset-close-part2/</guid>
<description><![CDATA[To continue on the same topic (see Part1 of this post): I have open a JIRA issue, because I think we]]></description>
<content:encoded><![CDATA[To continue on the same topic (see Part1 of this post): I have open a JIRA issue, because I think we]]></content:encoded>
</item>
<item>
<title><![CDATA[Updating Document Fields in Lucene]]></title>
<link>http://hrycan.com/2009/11/26/updating-document-fields-in-lucene/</link>
<pubDate>Thu, 26 Nov 2009 15:30:51 +0000</pubDate>
<dc:creator>Nick Hrycan</dc:creator>
<guid>http://hrycan.com/2009/11/26/updating-document-fields-in-lucene/</guid>
<description><![CDATA[Lucene 2.4.1 provides a convenient method for you to update a Document in your Index, namely the upd]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Lucene 2.4.1 provides a convenient method for you to update a Document in your Index, namely the updateDocument method of IndexWriter (shown below) but what do you do if you want to update the Fields of an existing document?</p>
<pre class="brush: java;">
public void updateDocument(Term term, Document doc)
                    throws CorruptIndexException, IOException
</pre>
<p>Lucene&#8217;s updateDocument operation is basically delete and insert wrapped into a single function.  All documents matching the Term parameter are deleted from the Lucene index and the supplied Document instance is then inserted into the index.  While Lucene allows multiple copies of the same document to exist in the index, the behavior of the update operation does not insert a copy of the supplied document for every match.  In other words, if your Term matches 5 documents in the index then 5 documents are deleted and a single document is inserted in its place.   </p>
<blockquote><p>
As you can see, it is a very good idea for you to design your documents so they have a field that uniquely identifies them in the entire index.  In the database world, this is called a primary key field.  </p>
<p>At times, it is helpful to think of the Lucene index as a database having a single table and the Documents as rows.  It is a good analogy when you frame it in terms of searching.  Boolean Queries seem to fit this concept nicely.  </p>
<p>However, there are many differences between a Lucene index and a database.  </p>
<ul>
<li>Lucene does not provide a way to enforce field uniqueness.  It is up to you to achieve the concept.</li>
<li>Lucene does not require a predefined document schema for the documents in the index.  This means all documents in the index do not need to have the same number of fields or use the same field names.  As an example, some documents can have the fields (id, url, contents) and other documents can have the fields (productid, manufacturer, summary, review).  </li>
<li>Fields can be repeated in a document.  For example, a document can have 3 product review fields (productid, manufacturer, summary, review, review, review).  We will revist this later in the code example.</li>
</ul>
</blockquote>
<p>Lucene&#8217;s updateDocument method overwrites the document(s) matching the given Term with the given Document.  This is a problem if you only want to update a few fields and keep the remainder.  </p>
<p>In the scenario pictured below, you can uniquely identify a document in the index whose author field you would like to update.  So you then call updateDocument and pass in the Term and a Document instance populated with the new author field value.  The result is an updated author field and the loss of the 3 other fields previously stored &#8211;  the title, publisher, and contents fields.</p>
<p><img src="http://hrycan.wordpress.com/files/2009/11/lucene-document-update.gif" alt="Visual of Lucene&#39;s update document method" title="lucene-document-update" width="355" height="155" class="aligncenter size-full wp-image-62" /></p>
<p>What do you do when you need to update a subset of the fields in a document but cannot re-create the remaining fields?  There can be many reasons for this dilemma.  Perhaps you are unable to re-create the fields because the original text is not available or perhaps the operations to re-create the fields are very costly.  </p>
<p>One approach to resolve this dilemma is to search for the current document in the index, change the desired fields, and use the modified document as the input to the updateDocument call.  This idea is illustrated below.  <a target="new" href="http://code.google.com/p/hrycan-blog/source/browse/trunk/lucene-highlight/src/com/hrycan/search/UpdateUtil.java">UpdateUtil.java</a> contains the full source.</p>
<pre class="brush: java;">
int docId = hits.scoreDocs[0].doc;

//retrieve the old document
Document doc = searcher.doc(docId);

List&#60;Field&#62; replacementFields = updateDoc.getFields();
for (Field field : replacementFields) {
	String name = field.name();
	String currentValue = doc.get(name);
	if (currentValue != null) {
		//replacement field value

		//remove all occurrences of the old field
		doc.removeFields(name);

		//insert the replacement
		doc.add(field);
	} else {
		//new field
		doc.add(field);
	}
}

//write the old document to the index with the modifications
writer.updateDocument(term, doc);
</pre>
<p>Here we pass in a Document that can have both replacement fields and additional fields for the document identified by a search using the term parameter as the basis for a TermQuery..  First we obtain the list of Fields from the document parameter.  If the matched document already has a field by that name, it is considered a replacement otherwise it is a new field to be added to the document.</p>
<p>Notice the method first removes all fields in the Document having the same name as the replacement prior to inserting the replacement field.  As mentioned earlier, a Lucene document can have multiple fields with the same name.  </p>
<p><img src="http://hrycan.wordpress.com/files/2009/11/lucene-index.gif" alt="visual of documents stored in a lucene index" title="lucene-index" width="266" height="249" class="aligncenter size-full wp-image-67" /></p>
<p>Without the remove call, you would be adding another value for the field instead of replacing the existing value.  </p>
<p>A great tool to view what is actually in your Lucene index is <a target="new" href="http://www.getopt.org/luke/">Luke</a>, the Lucene Index Toolbox.  It is very helpful tool to answer “what if” questions when you read the Lucene API.</p>
<p>Out of the box, <a target="new" href="http://lucene.apache.org/java/docs/">Lucene</a> does not provide a way to update the individual fields of a document in the index.  However, it is relatively easy to achieve this functionality by grouping together the available API calls.</p>
<p>You can <a target="new" href="http://code.google.com/p/hrycan-blog/source/browse/trunk/lucene-highlight/#lucene-highlight/src/com/hrycan/search">browse the full source</a> at google code and download a copy of the entire project via svn.<br />
svn checkout http://hrycan-blog.googlecode.com/svn/trunk/lucene-highlight/</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Optimizing the Lucene index with Hibernate Search]]></title>
<link>http://jadimeo.wordpress.com/2009/11/21/optimizing-the-lucene-index-with-hibernate-search/</link>
<pubDate>Sat, 21 Nov 2009 00:21:53 +0000</pubDate>
<dc:creator>jadimeo</dc:creator>
<guid>http://jadimeo.wordpress.com/2009/11/21/optimizing-the-lucene-index-with-hibernate-search/</guid>
<description><![CDATA[Here&#8217;s a small tip I found out while troubleshooting why my JVM was exiting in the middle of o]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Here&#8217;s a small tip I found out while troubleshooting why my JVM was exiting in the middle of optimizing my index.  </p>
<p>I created a utility class to create and close FullTextSessions for me that also created and committed a transaction at the same time for convenience throughout my code.  But this meant that the <code>optimize()</code> method returned while Lucene was still working, thus leaving my index locked even after the JVM exited:</p>
<pre>FullTextSession s = Search.getFullTextSession(factory.openSession());
s.beginTransaction();
s.getSearchFactory().optimize();
s.getTransaction().commit();
s.close();</pre>
<p>To get the method to block until the optimization work is completed, do not create a transaction:</p>
<pre>FullTextSession s = Search.getFullTextSession(factory.openSession());
s.getSearchFactory().optimize();
s.close();</pre>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Coding best practice : Lucene search query : resultSet.close()]]></title>
<link>http://alfrescoshare.wordpress.com/2009/11/19/coding-best-practice-lucene-search-query-resultset-close/</link>
<pubDate>Thu, 19 Nov 2009 10:15:43 +0000</pubDate>
<dc:creator>Enguerrand SPINDLER</dc:creator>
<guid>http://alfrescoshare.wordpress.com/2009/11/19/coding-best-practice-lucene-search-query-resultset-close/</guid>
<description><![CDATA[Just want to share a very important coding best practice if you are using Lucence search query in Al]]></description>
<content:encoded><![CDATA[Just want to share a very important coding best practice if you are using Lucence search query in Al]]></content:encoded>
</item>
<item>
<title><![CDATA[Apache Mahout 0.2 Released - Now classify, cluster and generate recommendations!]]></title>
<link>http://techdigger.wordpress.com/2009/11/18/apache-mahout-0-2-released-now-classify-cluster-and-generate-recommendations/</link>
<pubDate>Wed, 18 Nov 2009 13:48:32 +0000</pubDate>
<dc:creator>TechDigger</dc:creator>
<guid>http://techdigger.wordpress.com/2009/11/18/apache-mahout-0-2-released-now-classify-cluster-and-generate-recommendations/</guid>
<description><![CDATA[Apache Mahout For the past two years, I have been working with this amazing bunch of people whilst, ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><div class="wp-caption alignright" style="width: 92px"><a href="http://lucene.apache.org/mahout"><img src="http://lucene.apache.org/mahout/images/Mahout-logo-82x100.png" alt="Apache Mahout" width="82" height="100" /></a><p class="wp-caption-text">Apache Mahout</p></div>
<p align="justify">
For the past two years, I have been working with this amazing bunch of people whilst, being paid by Google in their summer of code program in a project called <a href="http://lucene.apache.org/mahout">Mahout</a>. And like the name says, it is trying to tame the young beast known as <a href="http://hadoop.apache.org">Hadoop</a>. I have received a lot from the community. Being part of the project, I have got some real exposure to Java, data mining, machine learning and hands on experience over distributed systems like <a href="http://hadoop.apache.org">Hadoop</a>, <a href="http://hadoop.apache.org/hbase">Hbase</a>, <a href="http://hadoop.apache.org/pig">Pig</a>.  The project is still in its infancy, but, its ambitions are high in the sky. I am happy to announce the second release of the project, and proud to be a part of it. I hope people will adapt it in their projects and that it becomes the defacto standard machine learning library the way lucene and hadoop has become in their respective focus areas.
</p>
<p>If you are already excited and want to take it for a ride, read Grant&#8217;s article on IBM developerworks <a href="https://www.ibm.com/developerworks/java/library/j-mahout/index.html">here</a><br />
The release announcement below</p>
<div align="justify" style="font-size:90%;border:1px dashed #337733;padding:10px;">
<p>Apache Mahout 0.2 has been released and is now available for public download at<a href="http://www.apache.org/dyn/closer.cgi/lucene/mahout">http://www.apache.org/dyn/closer.cgi/lucene/mahout</a></p>
<p>Up to date maven artifacts can be found in the Apache repository at<br />
<a href="https://repository.apache.org/content/repositories/releases/org/apache/mahout/">https://repository.apache.org/content/repositories/releases/org/apache/mahout/</a></p>
<p>Apache Mahout is a subproject of Apache Lucene with the goal of delivering scalable machine learning algorithm implementations under the Apache license. http://www.apache.org/licenses/LICENSE-2.0</p>
<p>Mahout is a machine learning library meant to scale: Scale in terms of community to support anyone interested in using machine learning. Scale in terms of business by providing the library under a commercially friendly, free software license. Scale in terms of computation to the size of data we manage today.</p>
<p>Built on top of the powerful map/reduce paradigm of the Apache Hadoop project, Mahout lets you solve popular machine learning problem settings like clustering, collaborative filtering and classification<br />
over Terabytes of data over thousands of computers.</p>
<p>Implemented with scalability in mind the latest release brings many performance optimizations so that even in a single node setup the library performs well.</p>
<p>The complete changelist can be found here:</p>
<p><a href="http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278">http://issues.apache.org/jira/browse/MAHOUT/fixforversion/12313278</a></p>
<p>New Mahout 0.2 features include</p>
<ul>
<li>Major performance enhancements in Collaborative Filtering, Classification and Clustering</li>
<li>New: Latent Dirichlet Allocation(LDA) implementation for topic modelling</li>
<li>New: Frequent Itemset Mining for mining top-k patterns from a list of transactions</li>
<li>New: Decision Forests implementation for Decision Tree classification (In Memory &#38; Partial Data)</li>
<li>New: HBase storage support for Naive Bayes model building and classification</li>
<li>New: Generation of vectors from Text documents for use with Mahout Algorithms</li>
<li>Performance improvements in various Vector implementations</li>
<li>Tons of bug fixes and code cleanup</li>
</ul>
<p>Getting started: New to Mahout?</p>
<ul>
<li> Download Mahout at <a href="http://www.apache.org/dyn/closer.cgi/lucene/mahout">http://www.apache.org/dyn/closer.cgi/lucene/mahout</a></li>
<li> Check out the Quick start: <a href="http://cwiki.apache.org/MAHOUT/quickstart.html">http://cwiki.apache.org/MAHOUT</a></li>
<li> Read the Mahout Wiki: <a href="http://cwiki.apache.org/MAHOUT">http://cwiki.apache.org/MAHOUT</a></li>
<li> Join the community by subscribing to mahout-user@lucene.apache.org</li>
<li> Give back: <a href="http://www.apache.org/foundation/getinvolved.html">http://www.apache.org/foundation/getinvolved.html</a></li>
<li> Consider adding yourself to the power by Wiki page:<a href="http://cwiki.apache.org/MAHOUT/poweredby.html">http://cwiki.apache.org/MAHOUT/poweredby.html</a></li>
</ul>
<p>For more information on Apache Mahout, see <a href="http://lucene.apache.org/mahout">http://lucene.apache.org/mahout</a>
</div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Webclient for SVN, Hudson e Artifactory]]></title>
<link>http://kaosktrl.wordpress.com/2009/11/17/webclient-for-svn-hudson-e-artifactory/</link>
<pubDate>Tue, 17 Nov 2009 22:47:03 +0000</pubDate>
<dc:creator>kaosktrl</dc:creator>
<guid>http://kaosktrl.wordpress.com/2009/11/17/webclient-for-svn-hudson-e-artifactory/</guid>
<description><![CDATA[Salve a tutti, in questi giorni mi sono addentrato ancora nel mondo del continuous integration perch]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Salve a tutti, in questi giorni mi sono addentrato ancora nel mondo del continuous integration perchè sto lavorando su un progetto in Java, gestito in outsourcing, dove aiuto un gruppo di sviluppatori ad utilizzare strumenti di sviluppo che facilitano il lavoro quali un sistema di continuos integration.</p>
<p>Sono partito da un progetto con subversion installato su una macchina e so che nel progetto usano Jboss con JDK 1.5 (ancora non so perchè)</p>
<p>Su un&#8217;altra macchina remota vado e installo JDK 1.5.0.22, Ant 1.7.1, Maven 2.2.1 (fino ad ora (2 mesi di progetto) hanno usato solo Ant e vorrebbero passare a Maven), Jboss 5.1.0, <a href="http://community.polarion.com/index.php?page=overview&#38;project=svnwebclient" target="_blank"><strong>Webclient for SVN 3.1.0</strong></a> e <a href="http://www.jfrog.org/" target="_blank"><strong>Artifactory 2.1.2</strong>.</a></p>
<p>Piccola nota: volevo usare <a href="http://www.sventon.org/" target="_blank"><strong>Sventon</strong></a> ma l&#8217;ultima versione aveva problemi con le librerie e le precedenti mi davano anch&#8217;esse problemi.</p>
<p>Non sapete cosa è Artifactory ??? Allora è possibile che siete ancora freschi su Maven (non è che io non lo sia <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  )</p>
<p>Artifactory è una applicazione web che permette di gestire repository Maven, sviluppata in Java e rilasciata con licenza LGPL 3.0. L&#8217;applicazione è realizzata da JFrog Ltd, una società privata isreliana.</p>
<p>Strutturalmente si poggia su <a href="http://jackrabbit.apache.org/" target="_blank">Apache Jackrabbit</a> per l&#8217;implemetazione delle specifiche <strong>JSR 170</strong> per il cosiddetto Java Content Repository (anche altri content manager come <a href="http://www.nuxeo.org/xwiki/bin/view/FAQ/Nuxeo52JCR" target="_blank">Nuxeo si appoggiano su Jackrabbit</a>).</p>
<p>Piccola nota: la JSR 170 (rilasciata nel 2005) è seguita dalla <strong>JSR 283</strong> da poco rilasciata (25 Settempre 2009) ed entrambe sono condotte da David Nuescheler, CTO della società svizzera <a href="http://www.day.com/day/en.html" target="_blank">Day Software</a>.</p>
<p>Inoltre Artifactory si basa su <a href="http://lucene.apache.org/java/docs/" target="_blank">Apache Lucene</a> per indicizzare i file e <a href="http://wicket.apache.org/" target="_blank">Apache Wicket</a> per l&#8217;interfaccia utente.</p>
<p>Ora l&#8217;applicazione è rilasciata come standalone ma esiste la versione war che può essere deployata su un application server e quindi l&#8217;ho messa su Jboss insieme ad Hudson e Webclient for SVN (questi 2 li ho poi legati con il <a href="http://wiki.hudson-ci.org/display/HUDSON/Polarion+Plugin" target="_blank">plugin di hudson per Polarion</a> che permette di linkare la lista di file modificati ai file presenti sul webclient e si possono anche vedere le differenze con le versioni precedenti).</p>
<p>La struttura è basata anche su database, di default <a href="http://db.apache.org/derby/" target="_blank">Apache Derby </a>ma che può essere cambiato (vedi anche <a href="http://blog.vinodsingh.com/2009/07/managing-maven-repository-with.html" target="_blank">qui)</a>.</p>
<p>A parte il fatto che Artifactory permette di base <a href="http://wiki.jfrog.org/confluence/display/RTF/Authenticating+with+LDAP" target="_blank">una autenticazione LDAP</a>, quello che lascia senza parole è la semplicità di uso (andrebbe confrontato con <a href="http://nexus.sonatype.org/" target="_blank"><strong>Nexus</strong></a> ma non ne ho tempo, di sicuro nella versione base manca LDAP (anche se ho scoperto di recente che esiste un plugin di terze parti) e di sicuro gli sviluppatori hanno realizzato una bella <a href="http://www.sonatype.com/products/maven/documentation/book-defguide" target="_blank">guida a maven</a> e un <a href="http://m2eclipse.sonatype.org/" target="_blank">plugin per Eclipse</a>). Una volta andato su</p>
<p>http://localhost:8080/artifactory</p>
<p>e loggato, sono andato sul pannello amministrativo e nella sezione repository ho creato un repository locale dove fare il deploy dei propri jar file e un repository virtuale in cui aggiungo quello locale più tutti i remoti quali quello di Maven, Jboss ecc ecc. Fatto ! Ora potete generare anche la sezione del file settings.xml da aggiungere nel vostro progetto Maven.</p>
<p>Per una vecchia guida vi rimando <a href="http://www.theserverside.com/tt/articles/article.tss?l=SettingUpMavenRepository" target="_blank">qui</a>. Per un confronto con Nexus e Archiva vi rimando a questo <a href="http://docs.codehaus.org/display/MAVENUSER/Maven+Repository+Manager+Feature+Matrix" target="_blank">link</a> (attenzione è scritto dallo stesso autore della guida su Artifactory ma sembra serio)</p>
<p>Buon maven a tutti !</p>
<p>Nota: Webclient for SVN si basa su Subversion 1.4 (con versioni successive ho avuto problemi perchè usa un SVNKit vecchio che non posso aggiornare perchè c&#8217;è una particolare classe non più presente nelle ultime versioni), potete trovare Subversion 1.4 qui:</p>
<p>http://downloads-guests.open.collab.net/servlets/ProjectDocumentList?folderID=6</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Open Source Search Social]]></title>
<link>http://richmarr.wordpress.com/2009/11/05/open-source-search-social-2/</link>
<pubDate>Thu, 05 Nov 2009 21:36:06 +0000</pubDate>
<dc:creator>Richard Marr</dc:creator>
<guid>http://richmarr.wordpress.com/2009/11/05/open-source-search-social-2/</guid>
<description><![CDATA[It&#8217;s been a little while since the last Open Source Search Social, so we&#8217;re getting real]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>It&#8217;s been a little while since <a href="http://richmarr.wordpress.com/2009/05/28/open-source-search-social/">the last Open Source Search Social</a>, so we&#8217;re getting really imaginative and holding another one, this time on Wednesday the 18th of November. As usual the event is in the Pelican pub just off London&#8217;s face-bleedingly trendy Portobello Road.</p>
<p>The format is staying roughly the same. No agenda, no attitude, just some geeks talking about search and related topics in the presence of intoxicating substances.</p>
<p>Please come along if you can, just get in touch or <a href="http://upcoming.yahoo.com/event/4839558/us/London/Open-source-search-social/">sign up on the Upcoming page</a>.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Documentum Search – Lucene, FAST, Verity, Google and upcoming DSS]]></title>
<link>http://blog.tsgrp.com/2009/10/27/documentum-search-%e2%80%93-lucene-fast-verity-google-and-upcoming-dss/</link>
<pubDate>Tue, 27 Oct 2009 20:56:33 +0000</pubDate>
<dc:creator>TSG Dave</dc:creator>
<guid>http://blog.tsgrp.com/2009/10/27/documentum-search-%e2%80%93-lucene-fast-verity-google-and-upcoming-dss/</guid>
<description><![CDATA[Since the new Documentum Search Services beta program just started last week, we thought we would sh]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Since the new Documentum Search Services beta program just started last week, we thought we would share some of TSG’s thoughts on full-text search and our plans to add Lucene capabilities to our open source offerings.</p>
<p>Documentum Search Services (DSS), was tentatively called Enterprise Search Services (ESS) early in the product development.    DSS promises to be “the next generation of search in EMC and will be built upon xDB with Apache Lucene as the underlying indices”.  Specific highlights from EMC World included:</p>
<ul>
<li>Relevance Sorting</li>
<li>Advanced Query Processing
<ul>
<li>Parallel, Native Facet computation, Xquery for structured and unstructured search</li>
<li>Lower Hardware and Storage Costs</li>
</ul>
</li>
<li>Native VMWare, NAS, SAN support and Advanced Data Placement</li>
</ul>
<p>At the present time, DSS is targeted for heavy testing through the end of 2009 with a release in 2010.</p>
<p><strong>TSG Thoughts on DSS</strong></p>
<p>At the present time, we are very encouraged with the progress and the direction of DSS.  We have been using Lucene for a couple of clients and can safely say that the tool will address many of the shortcomings of FAST including index rebuild, overall performance and server requirements.  That being said, the scope of DSS needs to encompass all of the Documentum API level functionality that FAST or Verity have addressed in the past.  As the beta progresses, truly the “devil is in the details” of how DSS evolves so we will with hold our final thoughts until the beta is complete.</p>
<p><strong>Other Tools (Autonomy, Google Appliance, SearchBlox, Vivisimo….)</strong></p>
<p>As an integrator, we do get asked to integrate in different search tools.  We began working with Autonomy for EMC on an internal Documentum project (pre-Documentum purchase) back in the late 90’s.  Overall, most search tools meet full-text needs but are typically built as “crawlers” focused on the web.  As a crawler, the tool needs to scan a directory/website for changes and then update the full text index.  We have found this approach difficult when Documentum clients want to do true “Documentum  Searches” of combining attribute, security and full text.  For example – one client wanted to search on secure documents a certain plant (attribute), create date (attribute) and containing this part-number (full-text).</p>
<p>Also, a couple of clients have had concerns in regards to latency of when a document is stored in Documentum and indexed (after the crawler runs) in the full-text search engine.  One client complained that with FAST, sometimes the latency was 2 minutes and other times it was 2 hours.</p>
<p>Our last concern with the crawler approach is how to get the index data and security added to the index to avoid having to run the query against Documentum (plant, create date, security), against full-text (part-number) and then only displaying the results that are on both lists.</p>
<p><strong>Native Lucene with Documentum?</strong></p>
<p>One scenario we are building out for clients is a Documentum 5.3 or 6.5 application that indexes documents into Lucene from either Documentum or a cached copy (<a href="http://www.tsgrp.com/multimedia/CIS_Case_Study.pdf" target="_blank">whitepaper here</a>).   To differentiate from DSS, our approach won’t provide support for inline DQL but rather a pure web services approach.</p>
<p>In the diagram below, both OpenMigrate and HPI use OpenContent web services to communicate with Lucene.  OpenMigrate is used to keep the Lucene index up to date, and HPI is used to query the index for full text searches and optionally metadata searches as well:</p>
<p><img class="alignnone size-full wp-image-461" title="full_text_arch" src="http://tsgrp.wordpress.com/files/2009/10/full_text_arch1.jpg" alt="full_text_arch" width="594" height="227" /></p>
<p>A couple of key factors:</p>
<ul>
<li>5.3 Support – we are focused on supporting both 5.3, 6.0, 6.5 and future releases.  Many of our clients have chosen to delay their upgrades due to variety of reasons.  By implementing Lucene now, clients can remove FAST in their current environment and from an eventual D6.5 upgrade.</li>
<li>Attributes – we are focused on storing both the content, attributes and security in Lucene to avoid having to search both the Documentum attributes and the Lucene full-text index.</li>
<li>Indexing – we are leveraging OpenMigrate to index/delete content and meta data to Lucene on a real-time, multi-threaded push basis to avoid a crawler approach.   We think the push approach can better control updates to the index, reduce server load on the full-text index and improve audit control to insure everything is indexed.</li>
<li>Security – One issue we addressed was how to manage security concerns versus high-performance search.  Verifying that the user has access to browse each document retrieved from the search (Documentum lookup) is expensive and would hurt performance as identified in the crawler discussion above.  One approach was to cache document ACL information with each document in Lucene and update as ACL’s are updated.  Since Documentum ACL’s don’t change often, we would leverage one lookup to retrieve the users ACL access and add that information to the Lucene query.</li>
</ul>
<p>So far our results have been favorable.  Please <a href="http://www.tsgrp.com/forms/contact-us.jsp">contact us</a> if you are interested in this type of solution as we are looking for additional case studies.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Article: Solr 1.4 Offers Richer Document Indexing, Faster Search]]></title>
<link>http://dee-annleblanc.com/2009/10/27/article-solr-1-4-offers-richer-document-indexing-faster-search/</link>
<pubDate>Tue, 27 Oct 2009 17:04:23 +0000</pubDate>
<dc:creator>deeleb</dc:creator>
<guid>http://dee-annleblanc.com/2009/10/27/article-solr-1-4-offers-richer-document-indexing-faster-search/</guid>
<description><![CDATA[In this article I cover the performance and feature improvements coming to Solr 1.4, along with chat]]></description>
<content:encoded><![CDATA[In this article I cover the performance and feature improvements coming to Solr 1.4, along with chat]]></content:encoded>
</item>
<item>
<title><![CDATA[Lucene Highlighter HowTo]]></title>
<link>http://hrycan.com/2009/10/25/lucene-highlighter-howto/</link>
<pubDate>Sun, 25 Oct 2009 22:14:45 +0000</pubDate>
<dc:creator>Nick Hrycan</dc:creator>
<guid>http://hrycan.com/2009/10/25/lucene-highlighter-howto/</guid>
<description><![CDATA[Background When you perform a search at Google or Bing, you enter your search terms, click a search ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><strong>Background</strong><br />
When you perform a search at Google or Bing, you enter your search terms, click a search button, and your search results appear.  Each search result displays the title, the URL, and a text fragment containing your search terms in bold.</p>
<p>Consider what happens when you search for &#8216;Apache&#8217; at Google.  Your results would include the Apache server, the Apache Software Foundation, the Apache Helicopter, and Apache County.   The contextual fragments displayed with each search result helps you judge if a search result is an appropriate match and if you need to add additional search terms to narrow the search result space.  Search would not be as user friendly as it is today without these fragments.</p>
<p>This post covers version 2.4.1 of <a href="http://lucene.apache.org/java/docs/index.html" target="new">Apache Lucene</a>, the popular open source search engine library written in Java.  It may not be widely known, but Lucene provides a way to generate these contextual fragments so your system can display them with each search result.  The functionality is not found in lucene-core-2.4.1.jar but in the contrib library lucene-highlighter-2.4.1.jar.  The contrib libraries are included with the Lucene download and are located in the contrib folder once the download is unzipped.</p>
<p>If you are not familiar with Lucene, you can think of it as a library which provides</p>
<ul>
<li> a way to create a search index from multiple text items</li>
<li> a way to quickly search the index and return the best matches.</li>
</ul>
<p>A more thorough explanation of Lucene can be found at the <a href="http://wiki.apache.org/lucene-java/LuceneFAQ" target="new">Apache Lucene FAQ</a>.</p>
<p>As an example of what the Lucene Highlighter can do, here is what appears when I search for &#8216;queue&#8217; in an index of PDF documents.</p>
<blockquote><p>e14510.pdf<br />
Oracle Coherence Getting Started Guide<br />
of the ways that Coherence can eliminate bottlenecks is to <strong>queue</strong> up  transactions that have occurred&#8230;<br />
duration of an item within the <strong>queue</strong> is configurable, and is referred to as the  Write-Behind Delay. When data changes, it is added to the write-behind <strong>queue</strong> (if it is  not already in the <strong>queue</strong>), and the <strong>queue</strong> entry is set to ripen after the configured  Write-Behind Delay has passed&#8230;</p></blockquote>
<p><strong>The Steps</strong><br />
First, before you can display highlighted fragments with each search result, the text to highlight must be available.  Shown below is a snippet of indexing code.  We are storing the text that will be used to generate the fragment in the contents field.</p>
<pre class="brush: java;">
Document doc = new Document();
doc.add(new Field(&#34;contents&#34;, contents, Field.Store.COMPRESS,
    Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field(&#34;title&#34;, bookTitle, Field.Store.YES,
    Field.Index.NOT_ANALYZED));
doc.add(new Field(&#34;filepath&#34;, f.getCanonicalPath(), Field.Store.YES,
    Field.Index.NOT_ANALYZED));
doc.add(new Field(&#34;filename&#34;, f.getName(), Field.Store.YES,
    Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
</pre>
<p>The values Field.Store.COMPRESS or Field.Store.Yes tell Lucene to store the the field in the index for later retrieval with a doc.get() invocation.</p>
<p>Field.Store.COMPRESS causes Lucene to store the contents field in a compressed form in the index.  Lucene automatically uncompresses it when it is retrieved.</p>
<p>Field.Index.ANALYZED indicates the field is searchable and an Analyzer will be applied to its contents.  An example of an Analyzer is StandardAnalyzer.  One of the things done by StandardAnalyzer is to remove stopwords (a, as, it, the, to, &#8230;) from the text being indexed.</p>
<blockquote><p>
Note: You should use the same analyzer type (like StandardAnalyzer) for your indexing and searching operations otherwise you will not get the results you are seeking.
</p></blockquote>
<p>Last part of the indexing side is the TermVectors.  From the <a href="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/document/Field.TermVector.html#YES" target="new">Lucene Javadocs</a>:<br />
&#8220;A term vector is a list of the document&#8217;s terms and their number of occurrences in that document.&#8221;</p>
<p>For the Highlighter, TermVectors need to be available and you have a choice of either computing and storing them with the index at index time or computing them as you need them when the search is performed.  Above,  Field.TermVector.WITH_POSITIONS_OFFSETS indicates were are computing and storing them in the index at index time.</p>
<p>With the index ready for presenting contextual fragments, lets move on to generating them while processing a search request.  Below is a typical “Hello World” type search block.</p>
<pre class="brush: java;">
QueryParser qp = new QueryParser(“contents”, analyzer);
Query query = qp.parse(searchInput);
TopDocs hits = searcher.search(query, 10);

for (int i = 0; i &#60; hits.scoreDocs.length; i++) {
	int docId = hits.scoreDocs[i].doc;
	Document doc = searcher.doc(docId);
	String filename = doc.get(&#34;filename&#34;);
	String contents =  doc.get(“contents”);

	String[] fragments = hlu.getFragmentsWithHighlightedTerms(analyzer,
                   query, “contents”, contents, 5, 100);
}
</pre>
<p>Starting at the top, we create a query based on the user supplied search string, searchInput, using the QueryParser.  Lucene supports a sophisticated <a href="http://lucene.apache.org/java/2_4_1/queryparsersyntax.html" target="new">query language</a> and QueryParser simplifies transforming the supplied string to a query object.  Next, we get the top 10 results matching the query.  This is pretty standard so far, but now in the loop we come to the getFragmentsWithHighlightedTerms call.</p>
<p>Here is the code to generate the fragments:</p>
<pre class="brush: java;">
TokenStream stream = TokenSources.getTokenStream(fieldName, fieldContents,
                      analyzer);
SpanScorer scorer = new SpanScorer(query, fieldName,
				new CachingTokenFilter(stream));
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 100);

Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);
String[] fragments = highlighter.getBestFragments(stream, fieldContents, 5);
</pre>
<p>First we obtain the TokenStream.  The call shown above assumes term vectors were not stored in the index at index time.</p>
<p>Next is the SpanScorer and SimpleSpanFragmenter.  These work to break the contents into 100 character fragments and rank them by relevancy.  You can use SpanScorer and SimpleSpanFragmenter or QueryScorer and SimpleFragmenter.  The full details can be found in the Javadocs.</p>
<blockquote><p>
Note: when indexing large files, like the full contents of <a href="http://www.redhat.com/docs/en-US/JBoss_Enterprise_Application_Platform/" target="new">PDF manuals</a>, you might need to tell the Highlighter object to look at the full text by calling the setMaxDocCharsToAnalyze method with Integer.MAX_VALUE or a more appropriate value.   In my case, the default value specified by Lucene was too small, thus Highlighter did not look at the full text to generate the fragments.  This was not good because the match I was seeking was near  the end of the contents.
</p></blockquote>
<p>Finally, we tell the Highlighter to return the best 5 fragments.  </p>
<p>The full code for this example can be downloaded from <a href="http://code.google.com/p/hrycan-blog/" target="new">my Google Code project</a>.  The source file that makes use of the Highlighter is <a href="http://code.google.com/p/hrycan-blog/source/browse/trunk/lucene-highlight/src/com/hrycan/search/HighlighterUtil.java" target="new">HighligherUtil.java</a></p>
<p>You can also find examples of using Highlighter in the Lucene SVN Repository, specifically <a href="http://svn.apache.org/viewvc/lucene/java/branches/lucene_2_4/contrib/highlighter/src/test/org/apache/lucene/search/highlight/HighlighterTest.java?view=markup" target="new">HighlighterTest.java</a></p>
<p>As you can see, returning search results with contextual fragments containing your search terms is very easy with the Lucene Highlighter contrib library once you know the steps to follow.  </p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Lucene search in PHP does not have stemming]]></title>
<link>http://stevelogic.wordpress.com/2009/10/24/lucene-search-in-php-does-not-have-stemming/</link>
<pubDate>Sat, 24 Oct 2009 22:43:48 +0000</pubDate>
<dc:creator>stevelogic</dc:creator>
<guid>http://stevelogic.wordpress.com/2009/10/24/lucene-search-in-php-does-not-have-stemming/</guid>
<description><![CDATA[I&#8217;m working on a web project which uses Zend&#8217;s implementation of Lucene in PHP. Fyi, Luc]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I&#8217;m working on a web project which uses <a href="http://framework.zend.com/manual/en/zend.search.lucene.html">Zend&#8217;s implementation of Lucene in PHP</a>.  Fyi, <a href="http://en.wikipedia.org/wiki/Lucene">Lucene</a> is an open source text retrieval library that you can use to provide full text search.  So the project I&#8217;m working on uses Lucene to provide full-text search on the site and the problem they were having was that searches for say, the word &#8220;orange&#8221; would not bring up articles that had the word &#8220;oranges&#8221;.  The problem was that no stemming was being done.  <a href="http://en.wikipedia.org/wiki/Stemming">Stemming</a> makes it so that if you search for a word like &#8220;happy&#8221;, it also matches on similar terms like &#8220;happily&#8221;, &#8220;happiness&#8221;, etc.    </p>
<p>So I thought it would be easy.  Just turn on stemming.  But I was a bit surprised that Zend&#8217;s Lucene implementation does not include a stemmer.  After a quick search, I found <a href="http://devzone.zend.com/article/3593">someone who had implemented a stemming analyzer for Zend&#8217;s Lucene</a>.  Tossed that in and voila, search is instantly more usable.  Anyway, Zend should really package this with their Lucene Search.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Some Alfresco optimization/tuning solutions]]></title>
<link>http://alfrescoshare.wordpress.com/2009/10/23/some-alfresco-optimizationtuning-solutions/</link>
<pubDate>Fri, 23 Oct 2009 16:40:50 +0000</pubDate>
<dc:creator>Enguerrand SPINDLER</dc:creator>
<guid>http://alfrescoshare.wordpress.com/2009/10/23/some-alfresco-optimizationtuning-solutions/</guid>
<description><![CDATA[We had a call with Mike Farman (Alfresco Product Manager) this morning, and here is the summary of o]]></description>
<content:encoded><![CDATA[We had a call with Mike Farman (Alfresco Product Manager) this morning, and here is the summary of o]]></content:encoded>
</item>
<item>
<title><![CDATA[Article: Apache Lucene 2.9.0 Offers Performance Optimization]]></title>
<link>http://dee-annleblanc.com/2009/10/17/article-apache-lucene-2-9-0-offers-performance-optimization/</link>
<pubDate>Sat, 17 Oct 2009 13:52:45 +0000</pubDate>
<dc:creator>deeleb</dc:creator>
<guid>http://dee-annleblanc.com/2009/10/17/article-apache-lucene-2-9-0-offers-performance-optimization/</guid>
<description><![CDATA[In this article, I look at performance optimization improvements coming to Apache Lucene 2.9.0: http]]></description>
<content:encoded><![CDATA[In this article, I look at performance optimization improvements coming to Apache Lucene 2.9.0: http]]></content:encoded>
</item>
<item>
<title><![CDATA[Semantic Embed: Part 2]]></title>
<link>http://web2point5.wordpress.com/2009/10/16/semantic-embed-part-2/</link>
<pubDate>Fri, 16 Oct 2009 16:32:37 +0000</pubDate>
<dc:creator>Kate Ray</dc:creator>
<guid>http://web2point5.wordpress.com/2009/10/16/semantic-embed-part-2/</guid>
<description><![CDATA[This is my second posting on an event by the New York Semantic Web Meetup, which covers all aspects ]]></description>
<content:encoded><![CDATA[This is my second posting on an event by the New York Semantic Web Meetup, which covers all aspects ]]></content:encoded>
</item>
<item>
<title><![CDATA[Senior Systems Administrator and Operations Engineer needed]]></title>
<link>http://mindsourceinc.wordpress.com/2009/10/13/senior-systems-administrator-and-operations-engineer-needed/</link>
<pubDate>Wed, 14 Oct 2009 00:53:35 +0000</pubDate>
<dc:creator>Michelle</dc:creator>
<guid>http://mindsourceinc.wordpress.com/2009/10/13/senior-systems-administrator-and-operations-engineer-needed/</guid>
<description><![CDATA[Our client in Redwood City, CA is in need of a seasoned system administrator and operations engineer]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Our client in<strong> Redwood City, CA</strong> is in need of a seasoned system administrator and operations engineer to manage, maintain and support multiple environments supporting several different architectures and environments. The successful candidate will have strong exposure to large web site deployments with multiple tiers include web front-ends, middleware components, backend services, data storage systems and 2nd/3rd tier components like caching and search layers.</p>
<p><strong>Requirements:</strong></p>
<ul>
<li>7 years Linux admin experience</li>
<li>5 years RHEL admin experience</li>
<li>5 years RPM management and admin experience</li>
<li>3 years working with MySQL</li>
<li>Sound understanding of web services and related technologies (HTTP, REST, XML, JSON, etc)</li>
<li>Experience with Linux web servers (Apache2, nginx, lighttpd, etc)</li>
<li>Experience with monitoring systems and components (nagios, cacti, etc)</li>
<li>Exposure to scripting languages such as Perl, Python, etc.</li>
</ul>
<p><strong>Really nice-to-haves but not required:</strong></p>
<ul>
<li>Experience with Erlang</li>
<li>Experience with Ruby on Rails</li>
<li>Exposure to Jabber/XMPP</li>
<li>Exposure to distributed / grid computing</li>
<li>ejabberd administration and configuration</li>
<li>Experience with second and third tier services and components like memcached (or like systems), search applications (lucene, xapian, sphinx, etc), etc.</li>
</ul>
<p><strong>Realistically, these are also nice to haves but not required:</strong></p>
<ul>
<li>Exposure to functional programming languages.</li>
<li>Exposure to the Facebook Platform, Open Social and or other web services and systems</li>
<li>Comfortable working in an Agile work environment.</li>
<li>Plays games and occasionally keeps tabs on the gaming world/industry.</li>
</ul>
<p>If interested, please send us your resume along with the rate per hour, contact number, and availability for a phone interview to <a href="mailto:raj@mindsource.com?subject=I am interested in the Systems Administrator and Operations Engineer position">raj@mindsource.com</a>.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[PyLuceneをWindowsでビルド]]></title>
<link>http://atsuoishimoto.wordpress.com/2009/10/03/pyluceneonwindows/</link>
<pubDate>Sun, 04 Oct 2009 04:58:13 +0000</pubDate>
<dc:creator>atsuoishimoto</dc:creator>
<guid>http://atsuoishimoto.wordpress.com/2009/10/03/pyluceneonwindows/</guid>
<description><![CDATA[必要なもの PyLucene これがなければ始まらない。ソースアーカイブをダウンロードしておいてください。 Visual Studio 2008 Visual Studio 2008 Express ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><h2>必要なもの</h2>
<ul>
<li><a href="http://lucene.apache.org/pylucene/">PyLucene<br />
</a>これがなければ始まらない。ソースアーカイブをダウンロードしておいてください。</li>
</ul>
<ul>
<li>Visual Studio 2008<br />
Visual Studio 2008 Express Edtionでは試していません。</li>
</ul>
<ul>
<li><a href="http://ant.apache.org/">Apache Ant<br />
</a>PyLuceneのビルドに使用します。バイナリをダウンロードして、適当にインストールしてください。私は c:\Program Files\apache-antにインストールしました。</li>
</ul>
<ul>
<li><a href="http://java.sun.com/javase/downloads/index.jsp">Java SE Development Kit</a></li>
</ul>
<ul>
<li><a href="http://www.cygwin.com/">Cygwin</a><br />
インストールする際、Develカテゴリの SubVersion と make を必ず選択してください。</li>
</ul>
<ul>
<li><a href="http://pypi.python.org/pypi/setuptools">setuptools</a></li>
</ul>
<h2>環境設定</h2>
<p>まず、PATHをJavaランタイムのjvm.dllがあるディレクトリとJDK、cygwinのbinディレクトリに通しておきます。私の場合だと</p>
<p>c:\Program Files\Java\jre6\bin\client;C:\Program Files\Java\jdk1.6.0_16\bin;c:\cygwin\bin;</p>
<p>を追加しました。</p>
<p>次に環境変数JAVA_HOMEを設定します。私の環境では</p>
<p>JAVA_HOME = C:\Program Files\Java\jre6</p>
<p>としました。</p>
<h2>JCCのビルド</h2>
<p>まず、JavaライブラリのC++インターフェースを作成する JCC モジュールをビルドします。</p>
<h3>1. setup.pyを修正する</h3>
<p>&#60;pylucene&#62;\jcc にある setup.py を環境にあわせて修正します。私の環境では次のように修正しました。</p>
<pre>
Index: setup.py
===================================================================
--- setup.py    (revision 820678)
+++ setup.py    (working copy)
@@ -39,7 +39,7 @@
'ipod': '/usr/include/gcc',
'linux2': '/usr/lib/jvm/java-6-openjdk',
'sunos5': '/usr/jdk/instances/jdk1.6.0',
-    'win32': 'o:/Java/jdk1.6.0_02',
+    'win32': 'C:/Program Files/Java/jdk1.6.0_16',
}
if 'JCC_JDK' in os.environ:
JDK[platform] = os.environ['JCC_JDK']
@@ -61,7 +61,7 @@
'linux2': ['-fno-strict-aliasing', '-Wno-write-strings'],
'sunos5': ['-features=iddollar',
'-erroff=badargtypel2w,wbadinitl,wvarhidemem'],
-    'win32': [],
+    'win32': ["/O2", "/FD", "/EHsc", "/MD", "/GR", "/c", "/Zi", "/TP"],
}

# added to CFLAGS when JCC is invoked with --debug
</pre>
<h3>2. ビルド&#38;インストール</h3>
<p>あとは普通のPythonモジュールと同様に</p>
<pre>
cd &#60;pylucene&#62;\jcc
python setup.py install
</pre>
<p>で完了です。</p>
<h2>PyLuceneのビルド</h2>
<h3>1. Makefileを修正する</h3>
<p>環境に合わせて、Makefileを修正します。私の環境では次のように修正しました。</p>
<pre>
Index: Makefile
===================================================================
--- Makefile    (revision 820678)
+++ Makefile    (working copy)
@@ -105,12 +105,14 @@
#NUM_FILES=2

# Windows   (Win32, Python 2.5.1, Java 1.6, ant 1.7.0)
-#PREFIX_PYTHON=/cygdrive/o/Python-2.5.2/PCbuild
-#ANT=JAVA_HOME=o:\Java\jdk1.6.0_02 /cygdrive/o/java/apache-ant-1.7.0/bin/ant
-#PYTHON=$(PREFIX_PYTHON)/python.exe
-#JCC=$(PYTHON) -m jcc --shared
-#NUM_FILES=2
+PREFIX_PYTHON=/cygdrive/c/Python26
+JAVA_HOME=C:\Progra~1\Java\jdk1.6.0_16\
+ANT=/c/Progra~1/apache-ant/bin/ant.bat
+PYTHON=$(PREFIX_PYTHON)/python.exe
+JCC=$(PYTHON) -m jcc.__main__ --shared
+NUM_FILES=2

+
#
# No edits required below
#
</pre>
<h3>2. ビルド&#38;インストール</h3>
<pre>
cd &#60;pylucene&#62;
make
make install
</pre>
<p>で完了です。インストールしたら</p>
<pre>
make test
</pre>
<p>でちゃんと動作するか、確認してみましょう。</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Apache Lucene 2.9 ist da!]]></title>
<link>http://mycontainer.wordpress.com/2009/09/28/apaclucene-2-9/</link>
<pubDate>Mon, 28 Sep 2009 07:56:46 +0000</pubDate>
<dc:creator>wirbelschleppe</dc:creator>
<guid>http://mycontainer.wordpress.com/2009/09/28/apaclucene-2-9/</guid>
<description><![CDATA[Das beliebte Java-Sucheframework Apache Lucene ist in der Version 2.9 erschienen. Neben einem überar]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Das beliebte Java-Sucheframework Apache Lucene ist in der Version 2.9 erschienen. Neben einem <a href="http://lucene.apache.org/java/2_9_0/changes/Changes.html" target="_blank">überarbeiteten Unicode-Support</a> gibt es ein neues Query-Parser-Framework. Umsteiger von Version 2.8 sollten jedoch aufpassen, da sich die API zum Teil geändert hat. Auf JAXEnter wurde übrigens eine <a href="http://it-republik.de/jaxenter/serien/?s=16" target="_blank">Artikelserie zum Thema Lucene</a> eingerichtet.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Introduction to Lucene]]></title>
<link>http://meghsoft.wordpress.com/2009/09/05/introduction-to-lucene/</link>
<pubDate>Sat, 05 Sep 2009 06:48:55 +0000</pubDate>
<dc:creator>meghsoft</dc:creator>
<guid>http://meghsoft.wordpress.com/2009/09/05/introduction-to-lucene/</guid>
<description><![CDATA[Lucene is an extremely rich and powerful full-text search library written in Java. You can use Lucen]]></description>
<content:encoded><![CDATA[Lucene is an extremely rich and powerful full-text search library written in Java. You can use Lucen]]></content:encoded>
</item>
<item>
<title><![CDATA[Apache Lucene - First Tutorial]]></title>
<link>http://webappeng.wordpress.com/2009/09/04/apache-lucene-first-tutorial/</link>
<pubDate>Fri, 04 Sep 2009 23:34:30 +0000</pubDate>
<dc:creator>betala</dc:creator>
<guid>http://webappeng.wordpress.com/2009/09/04/apache-lucene-first-tutorial/</guid>
<description><![CDATA[Download Lucene 2.4.1 , Extract to  c:\downloads Copy the luceneweb.war to the &lt;Tomcat Installati]]></description>
<content:encoded><![CDATA[Download Lucene 2.4.1 , Extract to  c:\downloads Copy the luceneweb.war to the &lt;Tomcat Installati]]></content:encoded>
</item>
<item>
<title><![CDATA[Free and Lucid Gaze At Optimizing Apache Lucene-Based Apps]]></title>
<link>http://irinaguseva.wordpress.com/2009/08/31/free-and-lucid-gazing-at-apache-lucene-based-apps/</link>
<pubDate>Mon, 31 Aug 2009 17:33:55 +0000</pubDate>
<dc:creator>Irina  Guseva</dc:creator>
<guid>http://irinaguseva.wordpress.com/2009/08/31/free-and-lucid-gazing-at-apache-lucene-based-apps/</guid>
<description><![CDATA[Lucid Imagination continues to show its (mainly dollar-amount-driven) dedication to open source Apac]]></description>
<content:encoded><![CDATA[Lucid Imagination continues to show its (mainly dollar-amount-driven) dedication to open source Apac]]></content:encoded>
</item>
<item>
<title><![CDATA[PHPLucene: apa yang berbeda]]></title>
<link>http://mfathur.wordpress.com/2009/08/30/phplucene-apa-yang-berbeda/</link>
<pubDate>Sun, 30 Aug 2009 00:43:29 +0000</pubDate>
<dc:creator>mfathur</dc:creator>
<guid>http://mfathur.wordpress.com/2009/08/30/phplucene-apa-yang-berbeda/</guid>
<description><![CDATA[Anda mungkin bertanya: berbeda dari apa? Tentu saja dari Lucene.Net hehehehe karena saya belum perna]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Anda mungkin bertanya: berbeda dari apa? Tentu saja dari Lucene.Net hehehehe karena saya belum pernah pakai yang Java.<br />
Pada prinsipnya baik yang di dotnet maupun yang di PHP, sama. Bahkan, katanya nih, soalnya aku belum coba, hasil index bisa dibaca oleh siapa saja.<br />
Trus bedanya apa? Ada beberapa catatan, tentu saja nanti bisa nambah:<br />
1. Di dotnet dibedakan mendaji dua class: membaca dan menulis, sementara di PHP cukup satu saja.<br />
2. Dalam pencarian, by default, PHP akan mencari kesemua field yang diindex. Jadi tidak repot dengan term dan query.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval]]></title>
<link>http://lingpipe-blog.com/2009/08/28/trieschnigg-2009-mesh-up-effective-mesh-textclassification-for-improved-document-retrieval/</link>
<pubDate>Fri, 28 Aug 2009 17:57:36 +0000</pubDate>
<dc:creator>lingpipe</dc:creator>
<guid>http://lingpipe-blog.com/2009/08/28/trieschnigg-2009-mesh-up-effective-mesh-textclassification-for-improved-document-retrieval/</guid>
<description><![CDATA[I&#8217;m about to implement the technique for assigning Medical Subject Heading (MeSH) terms to tex]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>I&#8217;m about to implement the technique for assigning <a href="http://www.nlm.nih.gov/mesh/">Medical Subject Heading (MeSH)</a> terms to text described in this paper from the <a href="http://en.wikipedia.org/wiki/Three-letter_acronym">TLA</a>s EBI, HMI and TNO ICT:</p>
<ul>
<li>Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) <a href="http://www.ncbi.nlm.nih.gov/pubmed/19376821">MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval</a>.  <i>Bioinformatics</i>. [<a href="http://bioinformatics.oxfordjournals.org/cgi/pmidlookup?view=long&#38;pmid=19376821">open access home</a>, including link to the <a href="http://bioinformatics.oxfordjournals.org/cgi/data/btp249/DC1/1">Methodology Supplement</a>]
</ul>
<p>The same technique could be applied to assign Gene Ontology (GO) terms to texts,  tags to tweets or blog posts or consumer products, or keywords to scientific articles.</p>
<h3>k-NN via Search</h3>
<p>To assign MeSH terms to a text chunk, they:</p>
<ol>
<li> convert the text chunk to a search query,
<li> run the query against a relevance-based MEDLINE index, then
<li> rank MeSH terms by frequency in the top k (k=10) hits.
</ol>
<p>In other words, k-nearest-neighbors (k-NN) where &#8220;distance&#8221; is implemented by a relevance-based search.  </p>
<h3>Just the Text, Ma&#8217;am</h3>
<p>Trieschnigg et al. concatenated the title and abstract of MEDLINE citations into a single field for both document indexing and query creation.  </p>
<p>k-NN implemented as search could be faceted to include journals, years, authors, etc.  For products, this could include all the facets seen on sites like Amazon or New Egg.</p>
<h3>Language-Model-Based Search</h3>
<p>They use language-model-based search, though I doubt that&#8217;s critical for success.  Specifically, they estimate the maximum-likelihood unigram language model for the query and interpolated (with a model trained on all documents) model for each document, and then rank documents versus that query by cross-entropy of the query model given the document model (given the MLE estimate for the query, this is just the log probability of the query in the document&#8217;s LM.) Other LM-based search systems measure similarity by <a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL-divergence</a>.</p>
<p>There weren&#8217;t any details on stemming, stoplisting, case normalization or tokenization in the paper or supplement; just a pointer to author Wessel Kraaij&#8217;s Ph.D. thesis on LM-based IR. </p>
<h3>Application to MEDLINE</h3>
<p>The text being assigned MeSH terms was another MEDLINE title-plus-abstract.   This may seem redundant given that MEDLINE citations are already MeSH annotated, but it&#8217;s useful if you&#8217;re the one at NLM who has to assign the MeSH terms, or if you want a deeper set of terms (NLM only assigns a few per document).   </p>
<p>It&#8217;s easy to apply the authors&#8217; approach to arbitrary texts, such as paragraphs out of textbooks or full text articles or long queries of the form favored by TREC.</p>
<h3>Efficient k-NN!</h3>
<p>I did a double-take when I saw k-nearest-neighbors and efficiency together.  As we all know, k-NN scales linearly with training set size and MEDLINE is huge.  <i>But</i>, in this case, the search toolkit can do the heavy lifting.  The advantage of doing k-NN here is that it reproduces the same kind of sparse assignment of MeSH terms as are found in MEDLINE itself. </p>
<h3>Error Analysis</h3>
<p>The authors did a nice job in the little space they devoted to error analysis, with more info in the supplements (PR curves and some more parameter evaluations and the top few hits for one example).  They reported that k-NN was better than some other systems (e.g. thesaurus/dictionary-based tagging and direct search with MeSH descriptors as pseudocuments) at assigning the sparse set of MeSH terms found in actual MEDLINE citations.  </p>
<p>Errors tended to be more general MeSH terms that just happened to show up in related documents.   I&#8217;d also like to see how sensitive performance is to the parameter setting of k=10, as it was chosen to optimize F measure against the sparse terms in MEDLINE.  (All results here are for optimal operating points (aka oracle results), which means the results are almost certainly over-optimistic.)</p>
<h3>What You&#8217;ll Need for an Implementation</h3>
<p>It should be particularly easy to reproduce for anyone with:</p>
<ul>
<li> half a <a href="http://lingpipe-blog.com/2009/02/18/lucene-24-in-60-seconds/">clue about Lucene</a> (search),
<li> a <a href="http://alias-i.com/lingpipe/demos/tutorial/medline/read-me.html">MEDLINE parser</a>, and
<li> optionally, a <a href="http://lingpipe-blog.com/2009/07/02/medical-subject-headings-mesh-parser/">MeSH parser</a>.
</ul>
<p>Look for a demo soon.</p>
<h3>Discriminative Semi-Supervised Classifiers?</h3>
<p>It&#8217;d be nice to see an evaluation with text generated from MeSH and articles referencing those terms that used any of the semi-supervisied or positive-only training algorithms (even just sampling negative training instances randomly) with some kind of discriminative classifier like logistic regression or SVMs.</p>
<h3>Improving IR with MeSH (?)</h3>
<p>I didn&#8217;t quite follow this part of the paper as I wasn&#8217;t sure what exactly they indexed and what exactly they used as queries.  I think they&#8217;re assigning MeSH terms to the query and then adding them to the query.  Presumably they also index the MeSH terms for this.  </p>
<p>I did get that only the best MeSH-term assigner improved search performance (quantized to a single number using TREC&#8217;s mean average precision metric).  </p>
<p>Alternatives like generating a 24,000-category multinomial classifier are possible, but won&#8217;t run very fast (though if it&#8217;s our tight naive Bayes, it might be as fast as the authors</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Social User Interface (SUIT): is the Enterprise 2.0 emperor naked?]]></title>
<link>http://saidimu.wordpress.com/2009/08/26/social-user-interface-suit-the-enterprise-2-0-emperor-is-naked/</link>
<pubDate>Tue, 25 Aug 2009 21:31:38 +0000</pubDate>
<dc:creator>saidimu apale</dc:creator>
<guid>http://saidimu.wordpress.com/2009/08/26/social-user-interface-suit-the-enterprise-2-0-emperor-is-naked/</guid>
<description><![CDATA[A lot has been written about why Enterprise 2.0 projects fail (the latest I&#8217;ve read: 14 Reason]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>A lot has been written about why Enterprise 2.0 projects fail (the latest I&#8217;ve read: <a href="http://blogs.zdnet.com/Hinchcliffe/?p=718" target="_blank">14 Reasons Why Enterprise 2.0 Projects Fail</a>).</p>
<p>Most of the articles I&#8217;ve read seem to ignore one point that I think is critical. In this post I will attempt to highlight that point.</p>
<p>In a previous incarnation with an international non-profit research organization, I piloted and implemented a &#8220;rogue&#8221; internal social networking application. It had to be rogue for the usual reasons (initial apathy from colleagues, hostility from IT types, zero resource-allocation, among others).</p>
<h4>How hard is it to be social?</h4>
<p>The biggest predictor of success, from my experience, was how well the application made it easy to <em>be social</em>. You&#8217;d be surprised that a vast majority of enterprise 2.0 applications make it quite difficult to <em> be social</em> in the way users already are. No wonder a lot of (re)training is required. No wonder user uptake is slow, if at all it takes off.</p>
<p>Instead of leveraging what should be the greatest asset out there, namely that humans already are incredibly social and want to further their social circles, most &#8220;social applications&#8221; toss out ways that work in the analog &#8220;real world&#8221; and try to digitally recreate unfamiliar ways of being.</p>
<p>Consider the following all-too-common mind-bending scenarios:</p>
<ul>
<li><strong>File-centric social applications</strong>:  I don&#8217;t know anyone who voluntarily shares files with colleagues just for the sake of sharing the file. People share information, seek comments, show-off their knowledge etc. Sharing the file is a means, not an end in itself. It still reeks of <em>file-ism</em> even after you add a few <em>social</em> widgets and re-brand.</li>
</ul>
<ul>
<li><strong>The requirement for voodoo markup</strong> in some wiki applications: who wants to do that when there are other, albeit non-social, ways that have the added benefit of WYSIWYG editors?</li>
</ul>
<ul>
<li><strong>The expectation that browsing to a website</strong> will be the primary interface for interaction with the application: what ever happened to email/IM integration? I purposefully only include email and IM as they already are accepted as legitimate in most enterprises (email more so than IM).</li>
</ul>
<ul>
<li><strong>Context-aware search</strong>: it isn&#8217;t enough that search is content-aware. If users are to meaningfully engage with the torrent of generated information, context-aware search engines should display relevant information <em>without the user necessarily searching for it</em>. This is important as one cannot possibly search for information they are not aware exists!</li>
</ul>
<h4>Social like I already am</h4>
<p><span style="font-weight:normal;">Despite a deep desire both philosophically and practically to go the open-source route, I finally had to make do with a quasi-open platform (<a href="http://www.jivesoftware.com/beyond/clearspace" target="_blank">Clearspace</a> from Jive Software, now confusingly re-branded <a href="http://www.jivesoftware.com/products/solutions" target="_blank">Jive Social Business Software</a>) principally because Clearspace made it easy to <em>be social</em> in a way that users already were.</span></p>
<p>This was pleasantly borne out when users, some with little training but most with no training, took to the platform with surprising gusto. Infact, some users developed such a personal affinity for the platform that they considered it <em>&#8220;our space and HR shouldn&#8217;t mess with it&#8221;</em>.</p>
<p>The sum total of tools that a user is expected to use in order to engage <em>socially</em> is what I&#8217;d call a <strong>Social User Interface (SUIT)</strong>. The net-effect of a conscious design decision to engage users in a <em>socially intuitive</em> manner, both at the visible and invisible level, is what makes for a well-designed Social User Interface (SUIT). On this metric, a good number of enterprise 2.0 applications need some dressing-up.</p>
<h4>How one application suits up</h4>
<p>Here&#8217;s how Clearspace won me over:</p>
<ul>
<li>People, and the resulting conversations, were at the center of Clearspace&#8217;s implementation. Users &#8220;got it&#8221; after a short period of using it. More &#8220;complex options&#8221; like tag-clouds often only required quick explanations.</li>
</ul>
<ul>
<li>The wiki app wasn&#8217;t even called a wiki, and most definitely did not look like one, even though it was one. Expanding/restricting allowed editors, and commentors, was such a breeze.</li>
</ul>
<ul>
<li>A really neat feature is <a href="http://www.jivesoftware.com/jivespace/docs/DOC-1985" target="_blank">tight email integration</a> where one can start/continue conversations, and create new content, from email. This allowed people who loathed visiting <em>another</em> website to be part of the community, right from within their email application. With time, some of these same people saw the added value of regularly directly interacting with the web application, because <em>it was worth their time</em>. IM integration was achieved via <a href="http://www.igniterealtime.org/projects/openfire/index.jsp" target="_blank">Openfire</a>, open-sourced by Jive and which has very tight integration with Clearspace. This provided another avenue for people to be part of the community, with the exact same permissions, from their IM clients.</li>
</ul>
<ul>
<li>Clearspace has a great &#8220;more like this&#8221; widget that displayed people/content that is deemed similar enough to what you are currently interacting with. The search engine behind this is the venerable open-source <a href="http://lucene.apache.org/" target="_blank">Apache Lucene</a>.
<ul>
<li>This is one area that is painfully lacking in <a href="http://elgg.org/" target="_blank">Elgg</a>, a promising and open-source &#8220;social engine&#8221; (I&#8217;m currently working on a Lucene plugin for Elgg).</li>
<li>A &#8220;semantic search engine&#8221;, for instance <a href="http://opencalais.com/" target="_blank">OpenCalais</a>, might even work better (I&#8217;m also working on an experimental plugin for Clearspace and Elgg).</li>
</ul>
</li>
</ul>
<h4>Better filters, and the &#8220;Trust Spectrum&#8221;</h4>
<p>There is definitely room for improvement in all social platforms/engines/applications that I have come across:</p>
<ul>
<li>There is no clear way to quickly determine where the content you are creating will end up in the <a href="http://www.giatalks.com/2008/12/the-trust-spectrum/" target="_blank">&#8220;trust spectrum&#8221;</a> (coined by <a href="http://www.giatalks.com/about/" target="_blank">Gia Lyons</a>: the digital corollary to the observation that we selectively share information in the real world). No application, not even Facebook, has an intuitive and quick indicator of  where, in the trust spectrum, your post will be placed. Remember Twitter&#8217;s <a href="http://mashable.com/2008/12/26/dm-fail/" target="_blank">DM Fail</a>?</li>
</ul>
<ul>
<li>Most social applications use a points system to highlight top/most-helpful/trusted contributers. While a noble attempt at capturing the social nuance that &#8220;not all opinions are created equal&#8221;, it is not something that comes naturally to most people. What is needed is a better filter, this is what this feature really is, that takes into account how we really determine who to pay attention to in our everyday life. I have some draft ideas on how this may work, perhaps for another post.</li>
</ul>
<p>What do you think? What has your experience been with regard to Social User Interfaces (SUITs)?</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[SOLR NOT condition trick]]></title>
<link>http://mvmn.wordpress.com/2009/08/21/solr-not-condition-trick/</link>
<pubDate>Fri, 21 Aug 2009 20:30:49 +0000</pubDate>
<dc:creator>mvmn</dc:creator>
<guid>http://mvmn.wordpress.com/2009/08/21/solr-not-condition-trick/</guid>
<description><![CDATA[Apache SOLR, an &#8220;open source enterprise search server&#8221;, demonstrates indecently inconsis]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><a href="http://lucene.apache.org/solr/">Apache SOLR</a>, an &#8220;open source enterprise search server&#8221;, demonstrates indecently inconsistent and illogical handling of <strong>NOT</strong> conditions in it&#8217;s queries (like &#8220;<code>x:1 AND NOT x:2</code>&#8221; producing 0 results, yet &#8220;<code>x:1 AND (NOT x:2)</code>&#8221; producing correct results etc). </p>
<p>We couldn&#8217;t play with braces because our conditions are autogenerated, so we needed a generic solution for this. </p>
<p>Fortunately, workaround has been found for this with simple Googling. Instead of querying for &#8220;&#8230; <code>AND/OR (NOT [condition])</code>&#8221; just use &#8220;&#8230; <code>AND/OR (*:* NOT [condition])</code>&#8221; or &#8220;&#8230; <code>AND/OR (*:* AND NOT [condition])</code>&#8220;.</p>
</div>]]></content:encoded>
</item>

</channel>
</rss>
