Trapped in codepoints no more! I’m freeing Chinese characters by Gábor Ugray Unless you’re one of the 1.5 billion people who learned to write Chinese in school, you probably only know the script as that remarkable thing where you must master thousands of characters to read the daily paper. But those thousands of characters have a structure: they mix, re-mix, combine and shuffle and creatively glue together only a few hundred components. Together, this makes up an intricate system where elements combine with other elements, adding a piece of meaning here and hinting at the pronunciation there. The details are often more arcane than the spelling of English words, but if you get the knack of the system, the script as a whole suddenly starts making sense. The way computers encode Chinese characters erases all of this. There are no components, no shapes, and no system of interlocking parts. It’s all reduced to one code point per character. I’m showing you how I’m building a dataset of Chinese characters and their parts. I promote SVG shapes to first-order citizens with meaning, sound, and historical background. The knowledge is out there in print books and unstructured digital content, but it’s never been collected in a thought-through machine-readable format. It’s a long journey: in 2 years I’ve covered 20% of the 9,000 characters in common use today. I’ll conclude by showing the incredibly cool tools this dataset makes possible, from an interactive two-dimensional graph of every Chinese character to a unique cross-linked character dictionary app. Gábor is co-founder of memoQ, the tool that brought real-time collaboration to translators before Google Docs was cool. He loves building whimsical language tools and has been known to train neural networks to translate. He blogs at jealousmarkup.xyz and tweets as @twilliability. He was last spotted zooming across Berlin on a sleek red racing bike.
Get notified about new features and conference additions.