Aule Browser: Kanjidic2 UTF-8 character literal

Wednesday, April 18, 2012

Kanjidic2 UTF-8 character literal

Today I found that not all characters in the kanjidic2 XML file could be parsed and displayed as UTF-8 – at least not in the fonts available to me in IE.

I have posted a utf-8 web page with those that would display after parsing with Curl's xml path library for the content of the "literal" element for each character element.

Parsing gave a total of 13,108 elements when I was expecting fewer than 7000. Issues only arose with the last thousand or so elements - and a few in that tail-end did display. Problems start after the first 12,166 elements.

In the last thousand elements, only the following few literal values would display:

匇匤咊增寬嵓德晥栁橫瀨炻甁皞礰竧綠緖荢薰譿賴郞鄕霻靍馞魲黑朗隆﨏塚﨑﨓凞猪神祥福﨟蘒﨡諸﨤都

If you have been parsing the kanjidic XML dictionary and have an idea, please drop a line.

UPDATE

Using a tail view of kanjidic2.xml I can see that the last UCS codepoint is FA6A which displays correctly in my browser as 頻.

I will revise my parse path to pull the codepoint instead of the literal.

Aule Browser

Blog Posts and Pages

Wednesday, April 18, 2012

Kanjidic2 UTF-8 character literal

No comments:

Post a Comment

Aule Browser Links

Aule Posts Archive

Search the aule-browser blog:

tag cloud