Wednesday, April 18, 2012

Kanjidic2 UTF-8 character literal


Today I found that not all characters in the kanjidic2 XML file could be parsed and displayed as UTF-8 – at least not in the fonts available to me in IE.

I have posted a utf-8 web page with those that would display after parsing with Curl's xml path library for the content of the "literal" element for each character element.

Parsing gave a total of 13,108 elements when I was expecting fewer than 7000. Issues only arose with the last thousand or so elements - and a few in that tail-end did display. Problems start after the first 12,166 elements.

In the last thousand elements, only the following few literal values would display:

匇 匤 咊 增 寬 嵓 德 晥 栁 橫 瀨 炻 甁 皞 礰 竧 綠 緖 荢 薰 譿 賴 郞 鄕 霻 靍 馞 魲 黑 朗 隆 﨏 塚 﨑 﨓 凞 猪 神 祥 福 﨟 蘒 﨡 諸 﨤 都

If you have been parsing the kanjidic XML dictionary and have an idea, please drop a line.

UPDATE

Using a tail view of kanjidic2.xml I can see that the last UCS codepoint is FA6A which displays correctly in my browser as .

I will revise my parse path to pull the codepoint instead of the literal.


No comments:

Post a Comment