<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress.com" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>character-encoding &amp;laquo; WordPress.com Tag Feed</title>
	<link>http://en.wordpress.com/tag/character-encoding/</link>
	<description>Feed of posts on WordPress.com tagged "character-encoding"</description>
	<pubDate>Sun, 27 Dec 2009 19:52:36 +0000</pubDate>

	<generator>http://en.wordpress.com/tags/</generator>
	<language>en</language>

<item>
<title><![CDATA[Why it is important to specify a character encoding]]></title>
<link>http://kahrn.wordpress.com/2009/10/29/why-it-is-important-to-specify-a-character-encoding/</link>
<pubDate>Thu, 29 Oct 2009 23:26:20 +0000</pubDate>
<dc:creator>kahrn</dc:creator>
<guid>http://kahrn.wordpress.com/2009/10/29/why-it-is-important-to-specify-a-character-encoding/</guid>
<description><![CDATA[Many website designers design really scrappy websites that do not follow standards at all. I myself ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Many website designers design <strong>really</strong> scrappy websites that do not follow standards at all. I myself tend to write all my XHTML to be XHTML1.1 compliant. As a reader of this blog, I will assume you also attempt to follow standards.</p>
<p>Usually I implement everything to pass xhtml transitional validation. One thing I usually ignore however, is the character encoding.</p>
<p id="firstHeading">Put simply, character encoding allows a browser to display and render the document as originally intended. For instance, browsing a site developed using a Japanese-based encoding (e.g. JIS X 0208) will not display correctly unless you have the JIS X 0208 character set installed on your computer.</p>
<p>Without specifying a character encoding, a default character encoding is used. So specifying a character encoding when developing sites that use other characters is a must. But a more important reason exists even if you only develop english websites using UTF-8 or ISO 8859-1. It is a potential <a href="http://code.google.com/p/doctype/wiki/ArticleUtf7" target="_blank">security vulnerability</a>.</p>
<p>Essentially, when a character encoding is not specified it could allow for a potential XSS-style attack. This can be achieved by encoding the javascript code using UTF-7. When a clients webbrowser attempts to autodetect the type of encoding used, it will detect it as UTF-7, and the javascript code can then be executed.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[I am sorry!]]></title>
<link>http://madharasan.wordpress.com/2009/09/28/i-am-sorry/</link>
<pubDate>Mon, 28 Sep 2009 15:20:47 +0000</pubDate>
<dc:creator>Jayarathina Madharasan</dc:creator>
<guid>http://madharasan.wordpress.com/2009/09/28/i-am-sorry/</guid>
<description><![CDATA[Yeap! I know it has been a very long time that i have posted something in this corner of web.. But i]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p style="text-align:justify;">Yeap! I know it has been a very long time that i have posted something in this corner of web.. But i am extremely sorry, quite a bit busy with my college work. In the near future, its hopeless, that i am going to post something too.&#160; I am extremely sorry for that&#8230;</p>
<p style="text-align:justify;">But i assure you that i&#8217;ll post at least one post per week&#8230;</p>
<p style="text-align:justify;">I am also partially busy because I am currently trying to make my Nokia N72 mobile, which doesn&#8217;t have native unicode support, to support unicode (Tamil). This is making me to break my head. If you have any idea or suggestions for that please contact me.</p>
<p style="text-align:justify;">That&#8217;s all for now&#8230; Have a nice day&#8230;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Unicode in SAP]]></title>
<link>http://naveenvishal.wordpress.com/2009/08/10/unicode-in-sap/</link>
<pubDate>Mon, 10 Aug 2009 17:18:40 +0000</pubDate>
<dc:creator>naveenvishal</dc:creator>
<guid>http://naveenvishal.wordpress.com/2009/08/10/unicode-in-sap/</guid>
<description><![CDATA[To have a better idea about what is Unicode, let us first look at what is a CodePage. CodePage A Cod]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>To have a better idea about what is Unicode, let us first look at what is a <strong>CodePage</strong>.</p>
<p><span style="color:#339966;"><strong>CodePage</strong></span></p>
<p>A Code page is another name for <strong>character encoding </strong>(A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers).</p>
<div id="attachment_683" class="wp-caption aligncenter" style="width: 460px"><img class="size-full wp-image-683" title="Code Page" src="http://naveenvishal.wordpress.com/files/2009/08/unicode.gif" alt="Code Page" width="450" height="490" /><p class="wp-caption-text">Code Page</p></div>
<p><!--more--></p>
<p><strong><span style="color:#339966;">Some relevant Terms:</span></strong></p>
<p><strong>Character:</strong> a, b, c, A, B,&#8230;</p>
<p><strong>Coded character:</strong> A=65, B=66,&#8230;</p>
<p><strong>Character Set:</strong> A set of characters, to be used together (e.g. Latin alphabet)</p>
<p><strong>Code page:</strong> A set of coded characters (e.g. ISO-8859-1, Shift-JIS)</p>
<p><strong>Locale:</strong> Code page + properties and rules (e.g. isdigit, collation, &#8230;)</p>
<p><strong><span style="color:#339966;">Why the Need for Unicode</span></strong></p>
<p>Every standard code page supports only a certain group of languages (e.g Western European, Eastern European, Japanese).</p>
<p>Within one computer system only one code page can be supported in a clean way. Therefore a universal code page that supports all letters, punctuation signs, technical symbols etc. of all languages is required.</p>
<p><strong><span style="color:#339966;">Unicode Features</span></strong></p>
<p>Unicode is a superset of all existing character sets. Unicode encodes plain text (no rendering information). It defines characters, not glyphs (semantics, not visual representation). Unicode unifies characters used in different scripts (CJK* Unification; CJK= Chinese, Japanese, Korean).</p>
<p>In Unicode there is a space for 1,000,000 characters. 64,000 characters coded by one 16bit code point. Further characters coded by two 16bit code points (surrogates).</p>
<div id="attachment_684" class="wp-caption aligncenter" style="width: 460px"><img class="size-full wp-image-684" title="Unicode - Detailed view" src="http://naveenvishal.wordpress.com/files/2009/08/unicode_details.jpg" alt="Unicode - Detailed view" width="450" height="283" /><p class="wp-caption-text">Unicode - Detailed view</p></div>
<p><strong><span style="color:#339966;">Unicode Encoding Forms</span></strong></p>
<p><strong>UTF-8</strong></p>
<p>byte-based encoding scheme; one character is coded with 1-4 bytes; compatible with 7-bit ASCII.</p>
<p><strong>UTF-16</strong></p>
<p>16bit units; often used characters occupy one 16bit unit; further characters are coded with two 16bit units</p>
<p><strong>UTF-32</strong></p>
<p>32bit units; fixed size for all characters</p>
<p>Note: All encoding forms support the same amount of characters.</p>
<p><strong><span style="color:#339966;">Unicode in SAP</span></strong></p>
<p><strong>UTF-8</strong></p>
<p>for external communication (e.g file, network); no endian problems; minimum average data size; limited backward compatibility to non-Unicode systems.</p>
<p><strong>UTF-16</strong></p>
<p>internal (in memory); best compromise between memory usage and algorithmic complexity; fits to Java and Microsoft environment; best way to migrate existing ABAP and C programs.</p>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">© SAP 2008 / Page 15</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">UTF-8</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">for external communication (e.g file, network)</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">no endian problems</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">minimum average data size</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">limited backward compatibility to non-Unicode systems</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">UTF-16</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">internal (in memory)</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">best compromise between memory usage and algorithmic complexity</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">fits to Java and Microsoft environment</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1892px;width:1px;height:1px;">best way to migrate existing ABAP and C programs</div>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Без Уеб за мен и теб]]></title>
<link>http://sahwar.wordpress.com/2009/02/13/no-web-for-you-and-me/</link>
<pubDate>Fri, 13 Feb 2009 18:52:03 +0000</pubDate>
<dc:creator>Sah War</dc:creator>
<guid>http://sahwar.wordpress.com/2009/02/13/no-web-for-you-and-me/</guid>
<description><![CDATA[Следва поредното извинително съобщение, породено от липса на интернет връзка и на компютър, както и ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><em>Следва поредното извинително съобщение, породено от липса на интернет връзка и на компютър, както и от противоречията в ежедневното съществуване и мислене на автора сред търсенията и поривите на другите хоминиди.</em><br />
Този четвъртък, семейството ни се сдоби с нова ISP услуга на мястото на старата. Услугата е неограничен по трафик месечен интернет достъп, Евроком НЕТ е името, а тарифата е най-евтината в листата. Факт е, че върши добра работа, а достъпът е чрез кабел, разклонен от този за кабелна телевизия и завършваш с модем, като последният продължава с пореден кабел към компютъра. <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' />  За щастие, основната ни машина пристигна успешно вкъщи, след еднодневна сервизна поправка от страна на Multirama Пловдив, от които закупихме тази PC конфигурация през юни 2008. Поначало смятахме, че проблемът е от вентилаторите, поради ужасен стържещ шум, но изглежда се оказа, че и видеокартата е била опасно разместена или нещо подобно, звучеше сериозно, но техниците го поправиха за по-малко от ден и сега помощникът ми е отново в действие.<br />
Междувременно забелязах, че някой</p>
<blockquote><p>Thanks to dino kokalis for Bulgarian translation.</p></blockquote>
<p>е изпратил своя <a href="http://www.photoscape.org/ps/main/help_translate.php">български преводен файл</a> за <a href="http://www.photoscape.org/ps/main/index.php">PhotoScape</a> и този превод е <a href="http://www.photoscape.org/ps/main/download.php?update=on">включен</a> в официалната <a href="http://www.photoscape.org/ps/main/history.php">версия 3.3</a> на програмата. Явно някой ме е изпреварил и е открил за себе си поредната добра безплатна/с отворен код програма и е побързал да я побългари&#8230; При бързия преглед на <a href="http://download.photoscape.org/translate/bg/lang_3_3.txt">превода</a><br />
(може да изисква Character Encoding = Cyrillic( Windows-1251) за правилно рендване на текста) и сравнението му с <a href="http://download.photoscape.org/translate/en/lang_3_3.txt">оригиналните английски низове</a>, ми стана ясно, че авторът на българските низове се е постарал, но изглежда има място и за поправки тук таме &#8211; на места има правописни грешки, а на други се сещам за по-подходящи начини за превод на съдържанието на някои от съобщенията, изскачащи при кръжене с курсора(т.нар. ToolTips messages), както и на други по-важни места. Може би ще пооправя тези малки пропуски и/или ще пратя свой вариант за превод(на текстовия интерфейс на менютата на програмата, накратко &#8211; локализация/интернационализация) скоро.<br />
Дотогава, дали да не се заема с превод на английските низове(strings) от англ. езиков файл на безплатната Greenfish Icon Editor Pro?   <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_question.gif' alt=':?:' class='wp-smiley' />  Вероятно пак ще се забавя, но ще го направя. Все пак, няма перфектно локализиран езиков файл, всички се учим и напредваме постепенно. <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Проектът HUTPIB е в застой, но все ще се намерят родолюбиви люде готови безплатно да локализират безплатна програма, която харесват и използват с удоволствие(или от нужда за изпълнение на специфична функционалност). <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /><br />
<span style="color:#ff0000;">EDIT</span><span style="color:#ff0000;">(16 April 2009):</span> Започнах с поправянето на грешките от вече съществуващия преводен файл, ще спомена тук когато съм готов и ще изпратя файла на екипа на PhotoScape, които при добро желание ще го пуснат за изтегляне от секцията им с преводи. Разбира се, веднъж приключил с това, ще се връщам понякога към превода, за да го усъвършенствам и да премахвам грешки. <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /><br />
Прогрес &#8211; 133/1046 реда завършени. <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_rolleyes.gif' alt=':roll:' class='wp-smiley' /> </p>
<p>♂♥♦→☻¢♠&#60;œ•••☻</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Change Windows Command Prompt Code Page]]></title>
<link>http://mingaz.wordpress.com/2008/12/15/change-windows-command-prompt-code-page/</link>
<pubDate>Mon, 15 Dec 2008 11:24:33 +0000</pubDate>
<dc:creator>mingaz</dc:creator>
<guid>http://mingaz.wordpress.com/2008/12/15/change-windows-command-prompt-code-page/</guid>
<description><![CDATA[I sometimes have to code small programs to extract information from data sources like databases or L]]></description>
<content:encoded><![CDATA[I sometimes have to code small programs to extract information from data sources like databases or L]]></content:encoded>
</item>
<item>
<title><![CDATA[Black Hat Japan 2008 Presentations]]></title>
<link>http://infosecphils.wordpress.com/2008/11/25/black-hat-japan-2008-presentations/</link>
<pubDate>Tue, 25 Nov 2008 07:47:12 +0000</pubDate>
<dc:creator>Jaime Raphael Licauco, CISSP, GSEC</dc:creator>
<guid>http://infosecphils.wordpress.com/2008/11/25/black-hat-japan-2008-presentations/</guid>
<description><![CDATA[Keynote &#8211; Black Ops of DNS 2008 : Its The End Of The Cache As We Know It by Dan Kaminsky API s]]></description>
<content:encoded><![CDATA[Keynote &#8211; Black Ops of DNS 2008 : Its The End Of The Cache As We Know It by Dan Kaminsky API s]]></content:encoded>
</item>
<item>
<title><![CDATA[Change Notes Font On iPhone/iPod touch]]></title>
<link>http://theappleblog.com/2008/10/28/change-notes-font-on-iphone-ipod-touch/</link>
<pubDate>Tue, 28 Oct 2008 19:00:01 +0000</pubDate>
<dc:creator>Clayton Lai</dc:creator>
<guid>http://theappleblog.com/2008/10/28/change-notes-font-on-iphone-ipod-touch/</guid>
<description><![CDATA[I really, really detest the Market Felt typeface used in the Notes application in the iPod touch and]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p class="excerpt">I really, really detest the Market Felt typeface used in the Notes application in the iPod touch and iPhone. It is hard to read, hard to edit points into, and looks plain childish.</p>
<p>Fortunately, I stumbled upon a way to easily have your notes displayed with <span style="text-decoration: line-through;">Arial</span> Helvetica. And, no, you do not need <a href="http://www.iphoneskinning.com/2008/03/fontswap-change-iphoneipod-touch-system-lock-and-notes-font.html" target="_blank">FontSwap</a>, jailbreaking or <a href="http://daringfireball.net/2007/09/marker_felt_iphone" target="_blank">any complicated deep system maneuvering</a>. All you need is right there in the iPhone OS.</p>
<p><img class="alignnone size-full wp-image-8655 styled" title="8647-img-0011.png" src="http://gigapple.files.wordpress.com/2008/10/8647-img-0011.png?w=320&#038;h=480" alt="" width="320" height="480" /></p>
<p>For this walkthrough I am using a second-generation iPod touch with firmware version 2.1.1 loaded. This also works with the iPhone.<br />
<!--more--></p>
<ol>
<li>In the Home Screen, tap on <strong>Settings</strong>.<br />
<img class="alignnone size-full wp-image-8779 styled" title="font_settings" src="http://gigapple.files.wordpress.com/2008/10/font_settings.png?w=320&#038;h=480" alt="" width="320" height="480" /></li>
<li>Tap on <strong>General</strong>, followed by <strong>Keyboard</strong> (it is near the bottom of the page).<br />
<img class="alignnone size-full wp-image-8775 styled" title="font_general" src="http://gigapple.files.wordpress.com/2008/10/font_general.png?w=320&#038;h=480" alt="" width="320" height="480" /> </p>
<p><img class="alignnone size-full wp-image-8778 styled" title="font_keyboard" src="http://gigapple.files.wordpress.com/2008/10/font_keyboard.png?w=320&#038;h=480" alt="" width="320" height="480" /></li>
<li>Tap on <strong>International Keyboards</strong>.<br />
<img class="alignnone size-full wp-image-8777 styled" title="font_international" src="http://gigapple.files.wordpress.com/2008/10/font_international.png?w=320&#038;h=480" alt="" width="320" height="480" /></li>
<li>Scroll down to the end of the page and tap on <strong>Chinese (Simplified)</strong>.<br />
<img class="alignnone size-full wp-image-8774 styled" title="font_chinese" src="http://gigapple.files.wordpress.com/2008/10/font_chinese.png?w=320&#038;h=480" alt="" width="320" height="480" /></li>
<li>In the page that comes up, you will see two settings, <strong>Handwriting</strong> and <strong>Pinyin</strong>. Turn both settings on.<br />
<img class="alignnone size-full wp-image-8776 styled" title="font_handwritingpinyin" src="http://gigapple.files.wordpress.com/2008/10/font_handwritingpinyin.png?w=320&#038;h=480" alt="" width="320" height="480" /></li>
<li>Go back to the Notes application. Create a new note or tap on an existing note to open it. If you are creating a new note, write a word or two so that you can easily verify that the typeface changes.</li>
<li>Tap in the note area to bring up the on-screen keyboard. Tap on the button to the left of the space bar, the one with <strong>an icon of a globe</strong>. The keyboard layout will change to one meant for Chinese handwriting recognition input. Don&#8217;t worry; just pay attention to the next step.<br />
<img class="styled" title="8647_IMG_0008.PNG" src="http://gigapple.files.wordpress.com/2008/10/8647-img-0008.png?w=320&#038;h=480" alt="8647_IMG_0008.PNG" width="320" height="480" /></li>
<li>On the left of the input area, you will see four rows of buttons. The first button at the top is the Backspace. The <strong>second button</strong> below it is what we are interested in. For those of you who do not understand Chinese, the button says &#8216;Space&#8217;.<br />
<img class="styled" title="8647_IMG_0009.PNG" src="http://gigapple.files.wordpress.com/2008/10/8647-img-0009.png?w=320&#038;h=480" alt="8647_IMG_0009.PNG" width="320" height="480" /></li>
<li>Tap on this button. The typeface of your note will change from Marker Felt to <span style="text-decoration: line-through;">Arial</span> Helvetica! But do read on! The next step is very important.<br />
<img class="styled" title="8647_IMG_0010.PNG" src="http://gigapple.files.wordpress.com/2008/10/8647-img-0010.png?w=320&#038;h=480" alt="8647_IMG_0010.PNG" width="320" height="480" /></li>
<li><strong>Tap the globe button twice</strong>, or if you have many input languages set up, tap it till the space bar flashes the words &#8216;English (US)&#8217; (or the name of your native language). Your input language is now back to default, and you can continue to edit your note. Even if you close your note, the typeface will remain as Arial. That&#8217;s it!</li>
</ol>
<p>Unfortunately, you will have to repeat the last few steps (7-10) for each new note you create, but this method does keep you from having to modify core iPhone files.</p>
<h3>Care to know what just happened?</h3>
<p>There are two ways to input Chinese characters on a computer, either by handwriting recognition or by a method known as Pinyin, in which a user spells out the phonetic pronunciation of Chinese words with the Latin alphabet on a standard keyboard. Each resulting Chinese character is then encoded—popularly with an encoding method called <a href="http://en.wikipedia.org/wiki/Unicode">Unicode</a>—into a document. This goes for the input of any other non-Western language as well.</p>
<p>Encoding non-Latin characters into a document requires a <a href="http://en.wikipedia.org/wiki/Unicode_typefaces#List_of_Unicode_fonts">compatible font</a>. Luckily for us, Market Felt is not a Unicode font. So, by picking a non-Latin input method, we are forcing the input engine to switch to a Unicode font such as <span style="text-decoration: line-through;">Arial</span> Helvetica so that it can display both Latin and non-Latin characters correctly. If you would like to know more about the magic that goes on in the background regarding character encoding, Wikipedia has <a href="http://en.wikipedia.org/wiki/Character_encodings">an entry</a> on the topic.</p>
<p><em>Note: This article has been corrected since publication.</em></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[HTML Codes, character encoding...]]></title>
<link>http://pimuri.wordpress.com/2008/09/19/html-codes-character-encoding/</link>
<pubDate>Fri, 19 Sep 2008 21:09:09 +0000</pubDate>
<dc:creator>pimuri</dc:creator>
<guid>http://pimuri.wordpress.com/2008/09/19/html-codes-character-encoding/</guid>
<description><![CDATA[Have you ever had problems finding HTML codes for special characters like these (♠ ♣ ♥ ♦ ) fellas? O]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Have you ever had problems finding HTML codes for special characters like these (♠ ♣ <span style="color:#ff0000;">♥ ♦</span> ) fellas? Or have you searched for simple ASCII codes or unicode encoding? I found this page called <a title="lookuptables.com" href="http://www.lookuptables.com/">lookuptables.com</a> yesterday and it has quite a few different tables you can easily print out or visit when you need to know something. It even has stuff like encoding for <a title="Braille - wikipedia" href="http://en.wikipedia.org/wiki/Braille">braille</a>. It&#8217;s quite handy because all those different code tables are on one website, easily accessable. Check it out.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[गिरगिट ...................................]]></title>
<link>http://abhyudaya.wordpress.com/2008/02/06/%e0%a4%97%e0%a4%bf%e0%a4%b0%e0%a4%97%e0%a4%bf%e0%a4%9f/</link>
<pubDate>Wed, 06 Feb 2008 20:43:00 +0000</pubDate>
<dc:creator>kuldeepsingh</dc:creator>
<guid>http://abhyudaya.wordpress.com/2008/02/06/%e0%a4%97%e0%a4%bf%e0%a4%b0%e0%a4%97%e0%a4%bf%e0%a4%9f/</guid>
<description><![CDATA[गिरगिट आप सभी तो जानते ही होंगे इस अजब गजब प्राणी को &#8230;&#8230;&#8230;&#8230;&#8230;&#8230;]]></description>
<content:encoded><![CDATA[गिरगिट आप सभी तो जानते ही होंगे इस अजब गजब प्राणी को &#8230;&#8230;&#8230;&#8230;&#8230;&#8230;]]></content:encoded>
</item>
<item>
<title><![CDATA[PHP und UTF-8]]></title>
<link>http://webzeug.wordpress.com/2007/12/14/php-und-utf-8/</link>
<pubDate>Fri, 14 Dec 2007 20:06:41 +0000</pubDate>
<dc:creator>webzeug</dc:creator>
<guid>http://webzeug.wordpress.com/2007/12/14/php-und-utf-8/</guid>
<description><![CDATA[Die Programmiersprache PHP verarbeitet bis Version 5 nur 1-Byte-Zeichen. Strings, die etwa UTF-8 cod]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Die Programmiersprache PHP verarbeitet bis Version 5 nur 1-Byte-Zeichen. Strings, die etwa UTF-8 codierte Zeichen enthalten, werden von den <a href="http://www.php.net/manual/de/ref.strings.php">Standart-Stringfunktionen</a> oft fehlerhaft verarbeitet.</p>
<p>Die Funktion <a href="http://www.php.net/manual/de/function.utf8-decode.php">utf8_decode()</a> bietet eine rudimentäre Möglichkeit, UTF-8 Strings in 1-Byte-Strings zu konvertieren. Dabei kann allerdings Informationsverlust auftreten.<br />
Mit den <a href="http://www.php.net/manual/de/ref.mbstring.php">Multibyte-Stringfunktionen</a> bietet PHP allerdings eine mächtige Methodensammlung, um etwa UTF-8-Strings zu verarbeiten. Mit Hilfe von Regular-Expressions können damit die meisten Stringmanipulationen geleistet werden. PHP muss, um die Multibyte-Stringfunktionen nutzen zu können, mit der mbstring-Erweiterung kompiliert werden.</p>
<p>Eine Alternative bietet die <a href="http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php">Scriptsammlung der splitbrain-Entwickler</a>, deren Funktionen klassische Stringfunktionen <a href="http://www.sitepoint.com/blogs/2006/08/10/hot-php-utf-8-tips/">simmulieren</a> und dabei UTF-8-Strings verarbeiten können, ohne auf die mbstring-Erweiterung angewiesen zu sein.</p>
<p>Da die Sammlung keinen Ersatz für str_word_count() enthält, habe ich mal einen geschrieben.<br />
utf8_str_word_count():<br />
<code><br />
/**<br />
* Works like str_word_count() for UTF8-strings<br />
* does not support locale information<br />
* does not exclude "'" or "" on beginning of a word<br />
* may behave strange if $format is not 0, 1 or 2<br />
*/<br />
function utf8_str_word_count($string,$format=0,$charlist='') {<br />
  $array = preg_split("/[^'\-A-Za-z".$charlist."]+/u",$string,-1,PREG_SPLIT_NO_EMPTY);<br />
  switch ($format) {<br />
    case 0:<br />
      return(count($array));<br />
    case 1:<br />
      return($array);<br />
    case 2:<br />
      $pos = 0;<br />
      foreach ($array as $value) {<br />
        $pos = utf8_strpos($string,$value,$pos);<br />
        $posarray[$pos] = $value;<br />
        $pos += utf8_strlen($value);<br />
      }<br />
      return($posarray);<br />
  }<br />
}<br />
</code></p>
<p>Achtung, die Funktion verwendet weitere UTF-8-Funktionen (utf8_strpos und utf8_strlen) aus der <a href="http://dev.splitbrain.org/view/darcs/dokuwiki/inc/utf8.php">Scriptsammlung von splitbrain</a>.</p>
<p>Die Entstehung dieser Funktion ist in <a href="http://www.selfphp.de/forum/showthread.php?s=aa2a2811f70e94a042c9049c6f5cddbe&#38;p=109780#post109780">diesem Forenbeitrag</a> dokumentiert.</p>
<p>Auf nicknettleton gibts noch <a href="http://www.nicknettleton.com/zine/php/php-utf-8-cheatsheet">wichtige Tipps</a>, um das gesamte System inkl. Datenbank auf UFT-8 umzustellen.</p>
<p>Ein String ist eine Folge von Zeichen. Dieser kann von einer Programmiersprache in Variablen gespeichert und  mittels Funktionen weiter verarbeitet werden.<br />
Es gibt allerdings unterschiedliche Arten, wie Zeichen abgespeichert werden, <a href="http://www.sitepoint.com/blogs/2006/03/15/do-you-know-your-character-encodings/">character encodings</a> genannt. Ist ein String etwa als ASCII gespeichert, können darin nur Zeichen aus der ASCII-Zeichentabelle verwendet werden. Jedes Zeichen nimmt dabei ein Byte Speicherplatz ein.</p>
<p>Da in den Sprachen der Welt eine <a href="http://www.joelonsoftware.com/articles/Unicode.html">Fülle verschiedenster Zeichen</a> zur Anwendung kommen, wurden zur Standartisierung der Speicherung zunächst der ISO-Standart, später Unicode entwickelt.<br />
UTF-8 ist der aktuelle Standart zur Speicherung von Zeichen. Er enthält eine Vielzahl von Zeichen aus vielen Sprachen. Zur Speicherung werden ein bis acht Byte verwendet.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[A tutorial on character code issues]]></title>
<link>http://objsam.wordpress.com/2007/11/16/a-tutorial-on-character-code-issues/</link>
<pubDate>Fri, 16 Nov 2007 14:18:26 +0000</pubDate>
<dc:creator>Syed Aslam</dc:creator>
<guid>http://objsam.wordpress.com/2007/11/16/a-tutorial-on-character-code-issues/</guid>
<description><![CDATA[If you are looking for some quick help in using a large character repertoire in HTML authoring, see ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p class="MsoNormal"><span style="font-size:10pt;line-height:115%;font-family:'Courier New';">If you are looking for some quick help in using a large character repertoire in HTML authoring, see the document <a href="http://www.cs.tut.fi/~jkorpela/html/chars.html">Using national and special characters in HTML</a>.</span></p>
<p class="MsoNormal"><span style="font-size:10pt;line-height:115%;font-family:'Courier New';">Several technical terms related to character sets (e.g. glyph, encoding) can be difficult to understand, due to various confusions and due to having different names in different languages and contexts. The <a href="http://iate.europa.eu/iatediff/SearchByQueryLoad.do;jsessionid=9ea7991c30d8ce28a31f34f4472fb0d1f75f05235297.e3iLbNeKc3mSe3aNbxuQa3eTay0?method=load">EuroDicAutom</a> online database can be useful: it contains translations and definitions for several technical terms used here.</span></p>
<p class="MsoNormal">&#160;</p>
<p class="MsoNormal"><span style="font-size:10pt;line-height:115%;font-family:'Courier New';">-&#62; <a href="http://www.cs.tut.fi/~jkorpela/chars.html">tutorial</a> &#60;-</span></p>
<p class="MsoNormal">&#160;</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Confirmed my understanding]]></title>
<link>http://krsethur.wordpress.com/2006/03/09/confirmed-my-understanding/</link>
<pubDate>Thu, 09 Mar 2006 18:27:19 +0000</pubDate>
<dc:creator>Krishnamoorthy Sethuraman</dc:creator>
<guid>http://krsethur.wordpress.com/2006/03/09/confirmed-my-understanding/</guid>
<description><![CDATA[In the earlier blog, the character not being displayed properly by the browser is the browser issue ]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>In the earlier blog, the character not being displayed properly by the browser is the browser issue and the code to deal with supplementary character is correct. The link http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=10177 shows the character that will be displayed in browser(which is the one present in xml file when opened by browser). So, to work with supplementary characters, create new String by passing character[] containing the low and high surrogate pairs or the byte[] specifying the encoding. The byte[] can for any code point can be obtained using the link :<br />
<a href="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=10177">Byte for codepoint</a></p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Some code to illustrate the unicode support by Character class]]></title>
<link>http://krsethur.wordpress.com/2006/03/09/some-code-to-illustrate-the-unicode-support-by-character-class/</link>
<pubDate>Thu, 09 Mar 2006 17:35:34 +0000</pubDate>
<dc:creator>Krishnamoorthy Sethuraman</dc:creator>
<guid>http://krsethur.wordpress.com/2006/03/09/some-code-to-illustrate-the-unicode-support-by-character-class/</guid>
<description><![CDATA[public class UnicodeTest{ public static void main(String&#8230; args) throws IOException { System.ou]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>public class UnicodeTest{<br />
public static void main(String&#8230; args) throws IOException<br />
{<br />
System.out.println(&#8220;is valid codepoint : &#8221; + Character.isValidCodePoint(0&#215;10FFFF));<br />
System.out.println(&#8220;is valid codepoint : &#8221; + Character.isValidCodePoint(0&#215;20FFFF));<br />
int cp = 0&#215;10177;<br />
System.out.println(&#8220;is valid codepoint : &#8221; + Character.isValidCodePoint(cp));<br />
char[] ch = new char[2];<br />
ch = Character.toChars(cp);<br />
int low = ch[0];<br />
int high = ch[1];<br />
System.out.println(&#8220;Low Surrogate Pair : &#8221; + low + &#8221; Hexadecimal : &#8221; + Integer.toHexString(low) + &#8221; Binary String : &#8221; + Integer.toBinaryString(low));<br />
System.out.println(&#8220;High Surrogate Pair : &#8221; + high + &#8221; Hexadecimal : &#8221; + Integer.toHexString(high) + &#8221; Binary String : &#8221; + Integer.toBinaryString(high));<br />
String st = new String(ch);<br />
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(&#8220;krishna.xml&#8221;), &#8220;UTF-8&#8243;));<br />
out.write(&#8220;&#8221;);<br />
out.write(&#8220;&#8221;);<br />
out.write(st);<br />
out.write(&#8220;&#8221;);<br />
out.close();</p>
<p>}<br />
}</p>
<p>In the above case, the supplementary character written in xml was not the one it is intended to be. Don&#8217;t know whether it is the limitation of browser to display the supplementary characters or anything wrong in the way a supplementary character be handled in java code.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Some more understandings on java unicode support]]></title>
<link>http://krsethur.wordpress.com/2006/03/09/some-more-understandings-on-java-unicode-support/</link>
<pubDate>Thu, 09 Mar 2006 16:07:50 +0000</pubDate>
<dc:creator>Krishnamoorthy Sethuraman</dc:creator>
<guid>http://krsethur.wordpress.com/2006/03/09/some-more-understandings-on-java-unicode-support/</guid>
<description><![CDATA[From the start, java had used UTF-16 encoding for encoding the characters. Thus, in the earlier stag]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>From the start, java had used UTF-16 encoding for encoding the characters. Thus, in the earlier stages when the unicode character set was limited to 16 bits and hence was given full support by java character which was using the utf-16 encoding. Once the unicode was extended to support till the range U+10FFFF, the earlier UTF-16 encoded characters cannot represent characters more than U+FFFF. Hence, in J2se5, support was provided through the Character class. So, the primitive char still supports only the characters till code point: UTF+FFFF. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).</p>
<p>A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:</p>
<p>* The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter(&#8216;\uD840&#8242;) returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.<br />
* The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0&#215;2F81A) returns true because the code point value represents a letter (a CJK ideograph).</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Back to Unicode support in java]]></title>
<link>http://krsethur.wordpress.com/2006/03/09/back-to-unicode-support-in-java/</link>
<pubDate>Thu, 09 Mar 2006 15:18:46 +0000</pubDate>
<dc:creator>Krishnamoorthy Sethuraman</dc:creator>
<guid>http://krsethur.wordpress.com/2006/03/09/back-to-unicode-support-in-java/</guid>
<description><![CDATA[Again got confused in unicode. Some of the terms used are: 1. Coded Character Set A character Set(co]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Again got confused in unicode. Some of the terms used are:<br />
<b><u>1. Coded Character Set</u></b><br />
A character Set(collection of characters) where each character has been assigned a unique number. E.g., Unicode character set, where every character is assigned a hexadecimal number.<br />
<b><u>2. Code Points</u></b><br />
The numbers that can be used in a coded character set. Valid code points for Unicode character set is : U+0000 to U+10FFFF (Unicode :4 standard)<br />
<b><u>3. Supplementary Characters</u></b><br />
Characters that could not be represented in the original 16-bit design of Unicode. U+0000 to U+FFFF are referred to as Base Multilingual Plane(BMP) and the others are supplementary characters.<br />
<b><u>4. Character Encoding Scheme</u></b><br />
Mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. e.g., UTF-32, UTF-16, and UTF-8<br />
<b><u>4. Character Encoding</u></b><br />
Mapping from a set of characters to sequences of code units. e.g., UTF-8, ISO-8859-1, GB18030, Shift_JIS.</p>
<p><u><b>UTF-16</b></u><br />
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0&#215;41 could mean the letter &#8220;A&#8221; or be the second byte of a two-byte character.</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[Java Unicode Support]]></title>
<link>http://krsethur.wordpress.com/2005/12/16/java-unicode-support/</link>
<pubDate>Fri, 16 Dec 2005 09:22:47 +0000</pubDate>
<dc:creator>Krishnamoorthy Sethuraman</dc:creator>
<guid>http://krsethur.wordpress.com/2005/12/16/java-unicode-support/</guid>
<description><![CDATA[Check the JSR-204 for java unicode support : JSR-204 Supplementary Character Support Approach Use th]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p>Check the JSR-204 for java unicode support : <a href="http://jcp.org/aboutJava/communityprocess/first/jsr204/index.html">JSR-204</a></p>
<p>Supplementary Character Support Approach</p>
<ul>
<li>Use the primitive type int to represent code points in low-level APIs, such as the static methods of the Character class.</li>
<li>Interpret char sequences in all forms (char[], implementations of java.lang.CharSequence, implementations of java.text.CharacterIterator) as UTF-16 sequences, and promote their use in higher-level APIs.</li>
<li>Provide APIs to easily convert between various char and code point based representations.</li>
</ul>
<p>Good blog on unicode support in j2se5 : <a href="http://weblogs.java.net/blog/joconner/archive/2004/04/unicode_40_supp.html">John Conner blog</a><br />
Highlights:<br />
# char is a UTF-16 code unit, not a code point<br />
# new low-level APIs use an int to represent a Unicode code point<br />
# high level APIs have been updated to understand surrogate pairs<br />
# a preference towards char sequence APIs instead of char based methods</p>
</div>]]></content:encoded>
</item>
<item>
<title><![CDATA[UTF-8 Encoding Rules]]></title>
<link>http://krsethur.wordpress.com/2005/08/25/112495295907319520/</link>
<pubDate>Thu, 25 Aug 2005 12:25:59 +0000</pubDate>
<dc:creator>Krishnamoorthy Sethuraman</dc:creator>
<guid>http://krsethur.wordpress.com/2005/08/25/112495295907319520/</guid>
<description><![CDATA[The UTF-8 encoding rules 1. Characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0&#215;0]]></description>
<content:encoded><![CDATA[<div class='snap_preview'><p><b><u><font size="3">The UTF-8 encoding rules</font></u></b></p>
<p><font face="Times New Roman">1. </font>Characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0&#215;00 to 0&#215;7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. A single byte is needed for any of these characters!<br />
<font face="Times New Roman">2. </font>All characters &#62;U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0&#215;00-0&#215;7F) can appear as part of any other character.<br />
<font face="Times New Roman">3. </font>The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0&#215;80 to 0xBF. This allows easy resynchronization andmakes the encoding stateless and robust against missing bytes<br />
<font face="Times New Roman">4. </font>All possible 231 UCS codes can be encoded<br />
<font face="Times New Roman">5. </font>UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit characters (the ones implicitly supported in the UCS-2 encoding, and by Str Library) are only up to three bytes long<br />
<font face="Times New Roman">6. </font>The sorting order of Bigendian UCS-4 byte strings is preserved<br />
<font face="Times New Roman">7. </font>The bytes 0xFE and 0xFF are never used in this encoding</p>
<p>The following byte sequences are used to represent a character:</p>
<p>U-00000000 &#8211; U-0000007F: 0xxxxxxx<br />
U-00000080 &#8211; U-000007FF: 110xxxxx 10xxxxxx<br />
U-00000800 &#8211; U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx<br />
U-00010000 &#8211; U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx<br />
U-00200000 &#8211; U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx<br />
U-04000000 &#8211; U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</p>
<p>The xxx bit positions are filled with the bits of the character code number in binary representation. The rightmost x bit is the least-significant bit. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.</p>
</div>]]></content:encoded>
</item>

</channel>
</rss>
