Doc Edward Morbius ⭕​ · @dredmorbius
2082 followers · 14677 posts · Server toot.cat

@charlotte There are certainly contexts in which Unicode unambiguously and demonstrably leads to security weaknesses and issues. See generally homoglyph attacks.

At the heart of the lie and damage is the existence of a message which appears to say one thing but in fact says something different. It's the very limited nature of 7-bit ASCII, 128 characters in total, which provide its utility here. Yes, this means that texts in other languages must be represented by transliterations and approximations. That's ... simply a necessary trade-off.

We see this in other domains, in which for the purposes of reducing ambiguity and emphasizing clarity standardisation is adopted.

Internationally, air traffic control communications occur in English, and aircraft navigation uses feet (altitude) and nautical miles (dstance) units.

Through the early 20th century, the language of diplomacy was French. The language of much scientific discourse, particularly in physics, was German. And for the Catholic Church, Latin was abandoned for mass only in the 1960s.

Trading and maritime cultures tend to creat pidgin languages --- common amongst participants, but foreign to all, as distinguished from a creole, an amalgam language with native speakers.

A key problem with computers is that the encodings used to create visual glyphs and the glyphs themselves are two distinct entities, and there can be a tremendous amount of ambiguity and confusion over similarly-appearing characters. Or, in many cases, glyphs cannot be represented at all.

Where the full expressive value of language is required --- within texts, in descriptive fields, and in local or native contexts, I'm ... mostly ... open to Unicode (though it can still present problems).

Where what is foremost in functionality is broad and universal understanding, selectinga small standardised and widely-recognised characterset has tremendous value, and no amount of emotive shaming changes that fact.

As an example, OpenStreetMap generally represents local place names in local language and charactersets. This may preserve respect or integrity to the local culture. As a user of the map, however, not knowing that language or charcterset, it is utterly useless to me. Or, quite frankly, anyone not specifically literate in that language and writing system.

It's worth considering that the characterset and language in question are themselves, adoptions and impositions: English was brought into Britain by invaders, the alphabet used itself is Roman, based on Greek and originally Phoenecian glyphs. English has adopted or incorporated terms from a huge set of other languages (rendering its own internal consistency ... low ... and making it confusing to learn).

International communications and signage, at airports, on roadways, in public buildings, on electronic devices, aims at small message sets and consistent, widely-recognised symbols, shapes, fonts, and colours. That is a context in which the freedoms of unfettered Unicode adoption are in fact hazardous.

(Yes, many of those symbols now have Unicode code points. It is the symbol set and glyph set which is constrained in public usage.)

And the simple fact is that a widely recognised encoding system will most often reflect on some power structure or hierarchy, as that's how these encodings become known --- English, Roman Alphabet, French, German, Latin, etc. Small minor powers tend not to find their writing systems widely adopted (yes, there are exceptions: use of Greek within the Roman empire, Hindu numbering systems). Again, exceptions.

#unicode #risk #complexity #constraints #homoglyphs #HomoglyphAttacks

Last updated 3 years ago