Maybe not at super large font sizes. But even lowercase i and l are easy enough to confuse at a glance mid-word in most sans-serif fonts, not to mention uppercase I and lowercase l. You don’t even need “confusable” glyphs to create a domain name that will stand up to a casual visual confirmation from a busy user in a phishing context.
But what about 'Ы'? It looks like 'bl', doen't it? 'Ы' is one codepoint and one glyph, though 'bl' is a sequence of two letters. I believe that the method described will miss such things. Cyrillic also has 'Ю', I suppose it is possible to design a font that make it look like 'lO'? Are there any fonts like this in a wild?
An interesting attempt, Claude. However, your promot is missing an important step to measure effectiveness against humans: wait 40-60 years for your vision to degrade naturally, and check the confusables again, preferably on a small phone screen. Bonus points if you can find someone with visual disabilities from birth. Obviously most attacks aren't pixel-perfect, but that's not the point, all you need to confuse are human eyes.
Things like the Fraktur characters are obvious mismatches in any font I know, I do do wonder why they're on the list.
I'm always intrigued by the German FE-Schrift ("fälschungserschwerende Schrift", "more-difficult-to-forge font") chooses shapes for characters that makes it hard for them to be turned into one another (like a 3 into an 8 or so):
As a youth in the DOS era, I was always enamored of fonts like OCR-A, there is some overlap between the problems of "make it easy to distinguish" and "make it hard to maliciously corrupt", although I can imagine some cases where they might be in conflict, especially if adding ink is asymmetrically easier than removing or covering it.
What I have always wondered about with FE-Schrift: they painstakingly made all glyphs distinguishable, but completely f'ed it up with V and Y: the "stalk" of the Y is vertical and so short that they're very easy to confuse. They could have made the "stalk" slanted, or even curved like in lowercase "g", and most people would have still recognized it as a "Y"...
0 and O, and l and I that look the same in a single font is a crime of modern typography.
Also, I remember 8x16 VGA font that came with KeyRus had some slight differences between Cyrillic and Latin lookalikes,
that brought some strange sense of comfort when reading, and especially typing the letter c, because its Cyrillic lookalike is located on the same key.
> A domain using only Cyrillic characters that happen to spell a Latin word (like “аpple” in all-Cyrillic) may still render in the address bar’s font and look identical.
that is very interesting.
I imagine the browser could take some context clues and switch rendering to puny code if the locale of the user is nowhere near a cyrillic region. But that is only going to patch some edge cases and miss others.
Ideally, the solution is password managers everywhere, which don't have this vulnerability, instead of using human eyes to visually recognize web urls and thus is vulnerable.
I think the lack of exploration of the context around the problem and current mitigations is an issue with the article - it spends a lot of time talking about the possible threat, but very little time on whether the attack is actually practical with modern mitigations.
>> A domain using only Cyrillic characters that happen to spell a Latin word (like “аpple” in all-Cyrillic) may still render in the address bar’s font and look identical
Here you go:
https:// аррlе.соm
(using English "l" and "m" here, Russian м looks differently)
This seems misguided. The fact that 'ρ' isn't a pixel for pixel match for 'p' doesn't mean they're not confusable. The threat model is not being unable to solve a spot-the-difference puzzle. Unless you are familiar with every pixel of your system fonts, and carefully scrutinize every character on your screen, the lack of an exact match in jρmorgan[.]com in a URL is going to do very little for you. There are many english characters that have multiple totally distinct ways to write them, so you can have two 'a' variants that are distinct but equally 'normal' looking. I guess if you get an LLM to write your blog posts they don't have to make much sense to begin with.
This is really cool. I loved the technical breakdown and side by side comparisons. Surprised to hear that Microsoft and MacOS default fonts didn't score so well!
Why are all the descending letters truncated in the titles? Not sure if it's a css glitch or terrible font choice. A bit ironic on an article about fonts.
> some patterns of speech are so recognizably LLM, i am convinced that the AI detection startups have a very strong chance to succeed on text.
The problem for them is the market. Those who actually want to buy AI detection tools usually want the impossible - detecting any kind of AI-written text, or even AI-written-human-edited text.
You're right in that many HN articles (not going to comment on this one specifically) are very easy to detect. But that's just because these article writers are too lazy to even use any of the plethora of tools that remove the smells automatically, or tools that write without them in the first place (I've made such a tool myself), or even just adjusting the prompt to write in a different style that avoids them.
Most people who would be interested in paying for AI detection tools want them to detect all of the above cases too, which is of course impossible.
I mean, no shit Sherlock, Cyrillic letters being indistinguishable from English ones is what Russian speakers have been using to get around braindead keyword сеnsоrshір¹ forever, same way kids type "de@th" on TikTok to avoid automoderation.
Most of the added value in this article can be summed up by saying that the Cyrillic glyphs are identical to the similar English ones in the fonts that author looked at (which isn't true for all fonts), and author didn't find many other such examples.
_______
¹ Try matching that word with "censorship" for fun
Maybe not. I checked OPs blog and he seem to be putting up 2-3 longer posts per day. Since it is LLM content, I have no idea whether it's mainly hallucinations or based on facts. So what did I learn from reading the article? Maybe nothing, maybe it's just made up.
Yes, some patterns of speech are recognizable … The "That's LLM generated" pattern is one of those. And while I can understand the motivation behind this, I find it more irritating now than LLM texts, if these contain useful information, which make me curious.
This text made me curious, I liked the approach the author has taken. And it made me think how I would do it. My first idea would be to use ImageMagick to render text and then use ImageMagick's https://imagemagick.org/script/compare.php to somehow calculate the risk of confounding glyphs.
Things like the Fraktur characters are obvious mismatches in any font I know, I do do wonder why they're on the list.
I'm always intrigued by the German FE-Schrift ("fälschungserschwerende Schrift", "more-difficult-to-forge font") chooses shapes for characters that makes it hard for them to be turned into one another (like a 3 into an 8 or so):
https://en.wikipedia.org/wiki/FE-Schrift
https://en.wikipedia.org/wiki/OCR-A
Also, I remember 8x16 VGA font that came with KeyRus had some slight differences between Cyrillic and Latin lookalikes, that brought some strange sense of comfort when reading, and especially typing the letter c, because its Cyrillic lookalike is located on the same key.
that is very interesting.
I imagine the browser could take some context clues and switch rendering to puny code if the locale of the user is nowhere near a cyrillic region. But that is only going to patch some edge cases and miss others.
Ideally, the solution is password managers everywhere, which don't have this vulnerability, instead of using human eyes to visually recognize web urls and thus is vulnerable.
Anyone reading this - please, please, please do not make any assumptions based on the end-user's geography.
Signed, someone who can cross 3 national and 4 language borders within a few hours of driving.
I think the lack of exploration of the context around the problem and current mitigations is an issue with the article - it spends a lot of time talking about the possible threat, but very little time on whether the attack is actually practical with modern mitigations.
Here you go:
https:// аррlе.соm
(using English "l" and "m" here, Russian м looks differently)
[0]: https://fonts.google.com/specimen/Syne
> "This is not theoretical. It is a measured property of the font files shipping on every Mac."
some patterns of speech are so recognizably LLM, i am convinced that the AI detection startups have a very strong chance to succeed on text.
> some patterns of speech are so recognizably LLM, i am convinced that the AI detection startups have a very strong chance to succeed on text.
The problem for them is the market. Those who actually want to buy AI detection tools usually want the impossible - detecting any kind of AI-written text, or even AI-written-human-edited text.
You're right in that many HN articles (not going to comment on this one specifically) are very easy to detect. But that's just because these article writers are too lazy to even use any of the plethora of tools that remove the smells automatically, or tools that write without them in the first place (I've made such a tool myself), or even just adjusting the prompt to write in a different style that avoids them.
Most people who would be interested in paying for AI detection tools want them to detect all of the above cases too, which is of course impossible.
Most of the added value in this article can be summed up by saying that the Cyrillic glyphs are identical to the similar English ones in the fonts that author looked at (which isn't true for all fonts), and author didn't find many other such examples.
_______
¹ Try matching that word with "censorship" for fun
I don't have a Mac.
This text made me curious, I liked the approach the author has taken. And it made me think how I would do it. My first idea would be to use ImageMagick to render text and then use ImageMagick's https://imagemagick.org/script/compare.php to somehow calculate the risk of confounding glyphs.
So: Don't be snarky? Maybe we need another rule here, to limit comments on "LLM style" https://news.ycombinator.com/newsguidelines.html