Bitten by Unicode

130 points by pryelluw 8 days ago

ks2048 8 days ago

That's only the tip of the iceberg of hyphen-looking characters.

Here's some more,

  2010 ; 002D ; MA #* ( ‐ → - ) HYPHEN → HYPHEN-MINUS # 
  2011 ; 002D ; MA #* ( ‑ → - ) NON-BREAKING HYPHEN → HYPHEN-MINUS # 
  2012 ; 002D ; MA #* ( ‒ → - ) FIGURE DASH → HYPHEN-MINUS # 
  2013 ; 002D ; MA #* ( – → - ) EN DASH → HYPHEN-MINUS # 
  FE58 ; 002D ; MA #* ( ﹘ → - ) SMALL EM DASH → HYPHEN-MINUS # 
  06D4 ; 002D ; MA #* ( ‎۔‎ → - ) ARABIC FULL STOP → HYPHEN-MINUS # →‐→
  2043 ; 002D ; MA #* ( ⁃ → - ) HYPHEN BULLET → HYPHEN-MINUS # →‐→
  02D7 ; 002D ; MA #* ( ˗ → - ) MODIFIER LETTER MINUS SIGN → HYPHEN-MINUS # 
  2212 ; 002D ; MA #* ( − → - ) MINUS SIGN → HYPHEN-MINUS # 
  2796 ; 002D ; MA #* (  → - ) HEAVY MINUS SIGN → HYPHEN-MINUS # →−→
  2CBA ; 002D ; MA # ( Ⲻ → - ) COPTIC CAPITAL LETTER DIALECT-P NI → HYPHEN-MINUS # →‒→

copied from https://www.unicode.org/Public/security/8.0.0/confusables.tx...

renhanxue 8 days ago
Three Minus Signs for the Mathematicians under the pi,
```
  2212 MINUS SIGN
  2796 HEAVY MINUS SIGN
  02D7 MODIFIER LETTER MINUS SIGN
```
Seven Dashes for the Dash-lords in their quotes as shown,
```
  2012 FIGURE DASH
  2013 EN DASH
  2014 EM DASH
  2015 QUOTATION DASH
  2E3A TWO-EM DASH
  2E3B THREE-EM DASH
  FE58 SMALL EM DASH
```
Nine Hyphens for Word Breakers, one of them ,
```
  00AD SOFT HYPHEN
  058A ARMENIAN HYPHEN
  1400 CANADIAN SYLLABICS HYPHEN
  1806 MONGOLIAN TODO SOFT HYPHEN
  2010 HYPHEN
  2011 NON-BREAKING HYPHEN
  2E17 DOUBLE OBLIQUE HYPHEN
  2E40 DOUBLE HYPHEN
  30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN
```
One for the Dark Word in the QWERTY zone
In the land of ASCII where Basic Latin lie.
One String to rule them all, One String to find them,
One String to bring them all and in the plain-text, bind them
In the land of ASCII where Basic Latin lie.
```
  002D HYPHEN-MINUS
```
- @FakeUnicode on Twitter, with apologies to J. R. R. Tolkien
markus_zhang 8 days ago

I think it's a good idea to write a plugin for any IDE to highlight those confusing characters.
- MrJohz 8 days ago
  
  I know vscode had this feature built in, and it's come in handy a couple of times for me.
- samatman 8 days ago
  
  VSCode does this out of the box actually. Ended up putting a few on a whitelist while writing Julia, where it can get kind of ugly (puts a yellow box around them).
- userbinator 8 days ago
  
  Using an ASCII-only font automatically shows all characters that IMHO should not be present in source code.
  - makeitdouble 8 days ago
    
    A note on non-ascii in code: I thought of it as an abomination, until hitting test pattern descriptors.
    On a project targeted at a non English speaking devs with a strong domain knowledge requirement, writing the test patterns (endless arrays of input -> expected output sequences, interspersed with adjustment code) in the native language saves an incredible amount of time and effort, in particular as we don't need to translate obscure notions into even more obscure English.
    And that had very little downsides as it's not production running code, lining will still raise anything problematic, and and the whole thing is easier to get reviewed by non domain experts.
    We could have made a translation layer to have the content in a spreadsheet and convert it to test code, but that's not any more stable than having unicode names straight into the code.
    
    nine_k 8 days ago
    
    String constants / symbols is one domain, keywords and reserved characters, another. They should be checked for different things. E.g. spell-checking string constants as plain text if they look as plain text is helpful. Checking for non-ASCII quotes / dashes / other punctuation outside quoted strings, where they can only occur by mistake, is also helpful.
    
    makeitdouble 7 days ago
    
    My comment got mistakenly autocorrected (meant "linting" instead of "lining"), which is so on point given the subject.
    I agree, and think a decent linter can deal with these issues, and syntax highlighting as well.
    In particular these kind of rules tend to get complicated with many exceptions (down to specific folders needing dedicated rules), so doing it as lint and not at the language level gives a lot of freedom on where and how to apply the rules and raise warnings.
  - keybored 8 days ago
    
    For every such Unicode problem (which is a data input^W source problem, not a programming source code error) there are fifty problems caused by the anemic ASCII character set like Unix toothpicks and three layers of escaping due to using too uniform delimiters.
    (Granted this is heavily biased since so much source code is ASCII-only so you don’t get many Unicode problems in the first place...)
  - PaulHoule 7 days ago
    
    It's a very unpopular opinion but I use as much Unicode as I can in source code. In comments for instance I can write
    x²
    as well as italic and bold characters (would have demoed but HN filters out Unicode bold & italics) and I can write a test named
    processes中文Characters()
    and also write Java that looks like APL, add sigil characters in code generated stubs that will never conflict with other people's code because they're too afraid to use these characters, etc.
    https://github.com/paulhoule/ferocity/blob/main/ferocity-std...
    People will ask "how do you enter those characters?" and I say "I don't know but I can cut and paste them, they get offered by the autocomplete, etc."
    
    1-more 6 days ago
    
    I had a beautiful vision when programming my keyboard. The style at the time was to write a massive array in C with the keycodes for the various layers. I put commented out box drawing characters between the lines to delineate where the keys are. I wanted to use the C Preprocessor to #define the thin vertical box drawing character as a comma, but somehow that was out of the range of acceptable characters. If I had that, then my source would be 1% more readable to me, the only person who's ever going to use it.
    https://github.com/qmk/qmk_firmware/compare/master...perkee:...
    I still use tons of box drawing characters in comments. I'm actually writing a little doodad to let me edit them fluidly then copy them into my block comments, because a truth table is easy to read!
    Your comment also reminds me of the introduction of type parameters/generics in Go via the Canadian Aboriginal syllabary for "po" and "pa" ("ᐸ" and "ᐳ") https://github.com/vasilevp/aboriginal
    
    Arnt 7 days ago
    
    Hardly unpopular where I live. Lots of source code contains € and much else. Grepping for it in the code I worked on last week, I find non-ASCII character in dozens of tests, in some scripts that seem to be part of CI, in a comment about a locale-specific bug, and I stopped looking there.
    How to enter them? Well, the keys are on the keyboard.
    
    PaulHoule 7 days ago
    
    If you're in Euro land.
    I have a lot of personal interest in Chinese language content these days, I have no idea how to set up and use an "input method" but I either see the text I want in front of me or ask an LLM "How do I write X in Chinese?" and either way cut and paste.
    
    sigseg1v 7 days ago
    
    Chinese enter words into a keyboard using the same type of keyboard you would use in North America. The characters are entered as "pinyin" which is a romanized phonetic method of describing Chinese words. You should be able to enter it into your keyboard on Windows for example by enabling Simplified Chinese / pinyin in the language input settings.
    
    Arnt 7 days ago
    
    That's pretty much "type an ASCII representation of a reasonable pronounciation, then pick the right character from the drop-down menu". Details vary but that's the gist.
  - powersnail 8 days ago
    
    That would make it impossible to edit non-ascii strings, like texts in foreign languages. As far as I know, most editors/IDE don't support switching fonts for string literals. It is more feasible for a syntax highlighter to highlight non-ascii characters outside of literals.
    
    Someone 8 days ago
    
    > As far as I know, most editors/IDE don't support switching fonts for string literals
    When asked to render an Unicode character that isn’t present in the font modern OSes will automatically pick a font that has it.
    https://en.wikipedia.org/wiki/Fallback_font: “A fallback font is a reserve typeface containing symbols for as many Unicode characters as possible. When a display system encounters a character that is not part of the repertoire of any of the other available fonts, a symbol from a fallback font is used instead. Typically, a fallback font will contain symbols representative of the various types of Unicode characters.”
    That can be avoided, for example by storing text as “one character per byte”, but I don’t think many editors do that nowadays.
    
    powersnail 7 days ago
    
    But that would not distinguish between chars inside a string literal and chars outside of a string literal.
  - lifthrasiir 8 days ago
    
    String literals frequently have non-ASCII characters to say the least.
  - oneeyedpigeon 8 days ago
    
    It depends on whether you count html as "source code", but if so, then non-ASCII characters absolutely should be present!
  - metadat 8 days ago
    
    Some platforms, such at python3 have full UTF-8 support already, so what is the problem?
    
    userbinator 8 days ago
    
    The one shown very clearly by this article.
    
    keybored 8 days ago
    
    The wrong values are from PDF files. Maybe you mean using a system-wide ASCII-only font but you finished your point with “should not be present in source code”. Source code wasn’t the problem here.
    
    foobarchu 7 days ago
    
    It very much is a problem in source code too though. It's unfortunately common in college courses (particularly non-CS courses with programming like bioinformatics) for instructors to distribute sample code as word docs. Cue students who can't run the code and don't know why because Word helpfully converted all double quotes to a "prettier" Unicode equivalent.
    
    keybored 7 days ago
    
    Bizarrely I have experienced the same thing from Latex with its purpose-made code/literal blocks.
    But the most shocking thing are printed learning resources on things like Haskell where the code examples on purpose are some kind of typographic printout rather than just the symbols themselves!
    
    metadat 8 days ago
    
    Thanks usrbinator.. guilty grimace smile
    Maybe highlighting isn't such bad idea :)
mjevans 8 days ago

Also remember to squash 'wide' characters back to the ASCII table where possible, if the data is being processed by normal tools.
There are honestly so many data-cleaning steps a pipeline could need / have to produce programatically well-formatted data.
tracker1 8 days ago

Yeah, quotes and magic quotes are another set... Nothing like discovering MySQL treats magic quotes as ANSI quotes for purposes of SQL (injection)... AddSlahshes wasn't enough.
For what it's worth TFA could still use a regexp, it would just be slightly more complex. But the conditional statement may or may not be faster or easier to reason with.
toastal 8 days ago

And yet all of these serve a different, useful purpose for semantics.
- account42 7 days ago
  
  As TFA shows, no they don't. They may have been intended for different semantics but once humans come into play if it looks vaguelly correct then its getting used.

amiga386 8 days ago

What's old is new again. People who use the wrong tools produce data in the wrong format.

You used to get people writing web pages in Microsoft Word, a tool designed for human prose, and so has "smart quotes" on by default, hence they write:

    <div class=“a b c d”>

which is parsed as:

    <div class="“a" b="" c="" d”="">

because smart quotes aren't quotes. The author used the wrong tool for composing text. They should have used a text editor.

I also find that even people in text editors sometimes accidentally type some combination that is invisibly wrong, for example Option+Space on macOS is a non-breaking space (U+00A0) rather than regular space (U+0020) and that's quite easy to type accidentally, especially If You're Adding Capitals because shift and option are pretty near each other.

Sometimes people also manage to insert carriage returns and/or linefeeds in what's supposed to be a single-line input value, so regular expressions using "." to match anything don't go beyond the first newline unless you turn on the "multiline" flag.

None of this is unicode specifically, it's just the age old problem of human ingenuity in providing nonstandard data, and whether _you_ do workarounds to fix it, or you make the supplier fix it.

oneeyedpigeon 8 days ago

> The author used the wrong tool for composing text. They should have used a text editor.
Then you have the opposite problem: most text editors make it non-trivial to work with unicode. I mean, I've taken the time to learn how to type curly quotation marks vs. straight ones, but not everyone has and keyboards don't make it easy.
- verandaguy 7 days ago
  
  My mental framework has been:
  - Curly quotes are a typographic sugar that's easier on the human eye when reading normal, human-language text. It's reasonable for them to be automatically inserted into your typing in something like a word processor, and depending on which language you're writing in, there may be strong orthographic rules about the use of curly braces (or their cognates, like « guillemets », etc).
  - Straight quotes belong in code by a combination of convention and practicality; unicode characters should be escaped wherever it's practical to do so (for example, if you must use "→" in your code, prefer to do e.g. "\u2192" instead -- it's clearer for future users which exact unicode arrow you were using there).
- pavel_lishin 8 days ago
  
  May I ask why you use curly quotation marks instead of the straight ascii ones?
  - oneeyedpigeon 7 days ago
    
    In written text, I think they're far more attractive. If I need to put forward some kind of 'objective' argument, then differentiating between open and closed seems to make logical sense. Check out any printed material: 99.9% of the time, it uses curly quotes.
euroderf 8 days ago

Smart quotes are the work of the Devil.
kragen 8 days ago

i have this problem a lot with markdown, because i very much do want my “” smart quotes in the formatted output, but markdown also (optionally) uses "" for link titles. i recently switched to using () for link titles, which i had forgotten was an option
also i sometimes accidentally replace a " with “ or ”, or a ' with a ‘ or ’, inside of `` or an indented code block
TheRealPomax 7 days ago

nit: at the time they should have used an HTML editor. Those still existed back then.

stouset 8 days ago

This highlights a way I constantly see people misuse regex: they aren’t specific enough. You weren’t bitten by Unicode, you were bitten by lazy and unprincipled parsing. Explicitly and strictly parse every character.

For here, assuming you already have only have the numeric value as a token, the regex should look like

    / ^ -? [0-9]+ ( \. [0-9]+ )? $ /x

or something similar. Match the beginning and end of the string and everything inbetween: an optional hyphen, any number of digits, and an optional decimal conponent. Feel free to adjust the details to match your spec, but any unexpected character will fail to parse.

bregma 8 days ago

That should be an optional minus sign, not an optional hyphen. Also, the radix character is locale-dependent so your should use a character class for it.
- nine_k 8 days ago
  
  Locale-dependent parsing is a bit more complicated.
  For instance, you likely want to accept locale-specific numerals, and any of 7７ ٧ ৭ 七 match the \d character class and mean "seven", but you likely don't want to accept a string as a valid number if different types of digits are mixed together.
  Also, 1,23,456.78 is fine in an Indian locale, but likely is a typo in the en_US or en_UK locales.
- stouset 7 days ago
  
  Sure, the details depend on the exact format you're trying to parse. But the point is that you should strictly and explicitly match every component of the string.
- IsTom 8 days ago
  
  > your should use a character class
  That depends on locale. Is "1,222" 1222 or 1.222?
  - account42 7 days ago
    
    But it definitely should not be the global process locale if you are parsing something that doesn't originate from the user's environment (and even then using something fixed like en_US or the saner en_DK unless a locale is explicitly requirested for the invocation makes sense).

Toxygene 8 days ago

Another option would be to detect and/or normalize Unicode input using the recommendations from the Unicode consortium.

https://www.unicode.org/reports/tr39/

Here's the relevant bit from the doc:

> For an input string X, define skeleton(X) to be the following transformation on the string:

    Convert X to NFD format, as described in [UAX15].
    Remove any characters in X that have the property Default_Ignorable_Code_Point.
    Concatenate the prototypes for each character in X according to the specified data, producing a string of exemplar characters.
    Reapply NFD.

The strings X and Y are defined to be confusable if and only if skeleton(X) = skeleton(Y). This is abbreviated as X ≅ Y.

This is obviously talking about comparing two string to see if they are "confusable" but if you just run the skeleton function on a string, you get a "normalize" version of it.

jrochkind1 8 days ago

This was my first thought -- I was specifically thinking the less typically used [K] "compatibility" normalization forms would do it.
But in fact, none of the unicode normalization forms seem to convert a `HYPHEN` to a `HYPHEN-MINUS`. Try it, you'll see!
Unicode considers them semantically different characters, and not normalized.
The default normalization forms NFC and NFD that are probably defaults for a "unicode normalize" function will should always result in exactly equivalent glyphs (displayed the same by a given font modulo bugs), just expressed differently in unicode. Like single code point "Latin Small Letter E with Acute" (composed, NFC form); vs two code points "latin small letter e" plus "combining acute accent" (decomposed, NFD form). I would not expect them to change the hyphen characters here -- and they do not.
The "compatibility" normalizations, abbreviated by "K" since "C" was already taken for "composed", WILL change glyphs. For instance, they will normalize a "Superscript One" `¹` or a "Circled Digit 1" `①` to an ordinary "Digit 1" (ascii 49). (which could also be relevant to this problem, and it's important all platforms expose compatibility normalization too!) NFKC for compatibility plus composed, or NFKD for compatibility plus decomposed. I expected/hoped they would change the unicode `HYPHEN` to the ascii `HYPHEN-MINUS` here.
But they don't seem to, the unicode directory decided these were not semantically equivalent even at "compatibility" level.
Unfortunately! I was hoping compatibility normalization would solve it too! The standard unicode normalization forms will not resolve this problem though.
(I forget if there are some locale-specific compatibility normalizations? And if so, maybe they would normalize this? I think of compat normalization as usually being like "for search results should it match" (sure you want `1` to match `①`), which can definitely be locale specific)
- Toxygene 7 days ago
  
  As you correctly observed, step one does not normalize 'HYPHEN-MINUS' to 'HYPHEN'. Instead, that occurs in step three, using the confusables data file [1].
  [1] https://www.unicode.org/Public/security/8.0.0/confusables.tx...
  - jrochkind1 6 days ago
    
    Aha, thanks!
    So, yeah, that technical report is about security, typically the potential problems of making a username or domain name or other identifier look like another.
    While OP wasn't about security, it does sound like the mapping potentially has non-security uses too as in OP.
    (The term "normalization" with regard to unicode usually means something else, specifically NFC, NFD, NFKC, or NFKD normalization from UAX#15, making this hard to talk about clearly, not sure what word to use for this "confusables" mapping).
    I haven't actually seen this particular algorithm/mapping discussed before. I'm not sure if routines to perform the mapping are available on common languages/platforms (ruby, python, node, java) -- if someone knows how to do it with, say, Java ICU4J library, it would be useful to see an example.
    The confusables.txt file provided does look like it would make it easy to implement the mapping algorithm. I might give it a stab in ruby.
    It's a bit confusing to think about what non-security contexts it's applicable without removing semantics you'd want.
    In, fact, TR39 says "The strings skeleton(X) and skeleton(Y) are not intended for display, storage or transmission," it's not totally clear if they'd think it was a good idea to use it in OP use case?
    If anyone has seen any writing on, or has any thoughts on, how to approach thinking about what non-security use cases and contexts doing this international "confusables" mapping is appropriate vs loss of semantics, I'd love to see it! Like I'm trying to think of whether you might want to map down these "confusables" for search indexing; it also seems like in some cases, especially without locale-specific data, you might be losing semantics you want to keep by doing this.
lexicality 8 days ago

Python even has a handy function for this: https://docs.python.org/3/library/unicodedata.html#unicodeda...

rdtsc 8 days ago

A bit off-topic but a thing that jumps out is using floats for currency. Good for examples and small demos but beware using it for anything serious.

lordmauve 8 days ago

The finance industry mostly uses floats for currency, up until settlement etc.
"What would I get for this share?" can be answered with a float.
"What did I get for selling this share?" should probably be a fixed point value.
- dotancohen 8 days ago
  
  Floats are fine for speculation. But they should not be used to record actual transactions.
  I typically use the smaller unit of a currency to store transaction amounts. E.g., for a US transaction of $10, I would store the integer 1000 because that is 1000 cents.
  - zie 7 days ago
    
    Or just use decimal numbers instead. Decimal libraries abound. Then you can do rounding however your jurisdiction/bank/etc does it too.
- dspillett 7 days ago
  
  > The finance industry mostly uses floats for currency, up until settlement etc.
  Having done some work on pensions and insurance point-of-sale illustrations & related in the 00s and very early 10s, I'd say that is not correct, at least not here in the UK. Almost everything was specified as intermediate values to 4dp (so hundredths of a penny) and rounded to 2dp for final results.
  Though it wasn't consistent: one of the problems we experienced was that while actuarial departments were following this rule others modelling for themselves in Excel or using online calculators (written in JS) were not (as those are all based on IEEE double-precision floats by default, you have to scale and manage scaled ints yourself to get accurate 4dp decimal) so we'd get reports that our calcs were off compared to what the planner's workbook gave (over the length of a pension the rounding errors can compound to quite a noticeable difference).
  I've not worked in that are for well over a decade now so maybe things have changed towards floats being the default, but that seems odd to my as it isn't an industry that tends to be happy with reducing precision.
zie 7 days ago

I would argue it's not even good for demos or examples :)

rzwitserloot 7 days ago

Isn't "turns out there are _lots_ of look-alikes, often literally pixel-for-pixel identical in most fonts at all sizes, in the unicode tables and that might cause some confusion when parsing text" like.. lesson 101 for unicode?

At any rate, I find the conclusion a bit hilarious: "Ah, the input text uses that symbol that very explicitly DOES NOT MEAN 'minus', it _ONLY_ means hyphen, and would be _the_ unicode solution if for whatever reason you want to render the notion: Hyphen followed by a _positive_ cash amount".. and.. I will completely mess up the whole point and just treat it as a minus sign after all.

What, pray tell, is the point of having all those semantically different but visually identical things in unicode when folks don't even acknowledge that what they are doing is fundamentally at odds with the very text they are reading?

HelloNurse 7 days ago

There might be a social angle: the input-shitters are assumed to be right, and the IT peons have to understand user intent and make the system work. If the boss says so, hyphen means minus.

samatman 8 days ago

Still broken, alas. '−', named MINUS SIGN, U+2212, is an Sm: Symbol, math. Arguably the one which should be used, meaning the risk of actually encountering it, while ε, is never 0.

As ks2048 points out, the only thing for it is to collect 'em all.

Which is why (shameless plug) I wrote this: https://github.com/mnemnion/runeset

hgs3 8 days ago

Unicode conforming regular expression engines are supposed to support the \p or \P property syntax [1] so you should be able to match hyphen characters with \p{Hyphen} or \p{Dash}.

[1] https://www.unicode.org/reports/tr18/#property_syntax

gknoy 7 days ago

Thanks for linking this! I also learned that support for `\p{}` syntax isn't supported in the Python `re` library, and they recommend the api-compatible `regex` library, which does have support for that.
account42 7 days ago

Very nice for Unicode to provide a solution to the problem Unicode created.
- samatman 7 days ago
  
  Unicode did not create the problem of many similar-looking dash-like characters with different meanings and widths.
  It documented it, at most.

kccqzy 8 days ago

Run this:

    >>> unicodedata.category('\N{MINUS SIGN}')
    'Sm'

There you go. No need to thank me for breaking your code.

Also, nobody has yet commented on the fact that the author is also doing PDF text extraction. That's yet another area where a lot of fuzziness needs to be applied. My confidence in the author's product greatly decreased after reading this post.

SomewhatLikely 8 days ago

Where I thought this might be going from the first paragraph:

Negative numbers are sometimes represented with parentheses: (234.58)

Tables sometimes tell you in the description that all numbers in are in 1000's or millions.

The dollar sign is used by many currencies, including in Australia and Canada.

I'd probably look around for some other gotchas. Here's one page on prices in general: https://gist.github.com/rgs/6509585 but interestingly doesn't quite cover the OP's problem or the ones I brought up, though the use cases are slightly different.

oneeyedpigeon 8 days ago

I was certain that it was going to be a range of numbers that didn't use an endash.

devit 8 days ago

This fix makes no sense:

if is_hyphen(value[0]) and value[1] == "$":

     converted_value = float(re.sub(r"[^.0-9]", "", value)) \* -1

If the strategy is to delete all non-numeric characters in re.sub, you should instead replace _all_ characters that could be a minus with '-' before doing the float(re.sub(...)) including the '-' instead of this bizarre ad-hoc code.

Also "is_hyphen" is wrong since it doesn't handle the Unicode minus sign.

wodenokoto 8 days ago

If your source is not consistent enough to give you consistent hyphen there are probably a lot of other weird things that are slipping through the cracks.

tomcam 8 days ago

That's pretty much all real-world datasets
- tracker1 8 days ago
  
  Considering how many real world data sets are based on hand crafted spreadsheets, absolutely. Especially with copy pasta.
  Edit: pasta above was actually meant to be paste, but gesture input is fun. Ironically it's better this way.
advisedwang 8 days ago

Probably true, but unless you are suggesting the author should abandon the product/feature, the author needs to achieve the best they can given the constraints. Stuff like fixing hypens gets closer. There's probably a lot more such things their code will end up doing.
- account42 7 days ago
  
  The author should use a validating parser instead of a simple regular expression and hoping that the result is correct. I.e. the start of the post should have been that the parser errored out rather than that the result was positive.
  - tomcam 6 days ago
    
    I think I disagree? Should one ever assume a dataset is sanitary?

mwkaufma 8 days ago

"For dollar figures I find a prefixed dollar symbol and convert the number following it into a float."

Bloombug red flag!!

wonnage 8 days ago

I feel like the responsible thing to do here is throw an error if you encounter an unexpected character. Others have already pointed out that there's an actual minus sign character that would break this. This code is dealing with like four different tricky/unpredictable things (parsing PDFs, parsing strings, unicode, money) and the lack of basic exception handling should raise alarm bells.

numpad0 7 days ago

Wow. That's basically what I've heard of as the the Kangxi radicals problem. From what I could gather from 5-minute search, the mechanism is:

PDFs don't use Unicode or ASCII codepoints, but Glyph ID used by fonts. Therefore all strings are converted to sequences of that Glyph ID. Original Unicode or ASCII texts are dropped, or can be linked and embedded for convenience. In many cases, a reverse conversion from ID to Unicode is silently done when text is copy-pasted or extracted from PDF.

That silent automatic reverse conversion tend to pick the numerically smallest Unicode codepoint assigned to the glyph(letter shapes), and many fonts reuses close-enough glyphs for obscure Unicode characters like ancient Chinese dictionary header symbols and dozen Unicode siblings of hyphens. Unicode also tends to have those esoteric symbols higher up in the table than commonly used ones.

Therefore, through conversion into Glyph ID and back into Unicode, some of simple characters like `角` or `-`, which glyphs tend to get reused to cover those technicalities, sometimes gets converted into those technicalities at remote ends.

1: https://en.wikipedia.org/wiki/Kangxi_radical

2: use TL: https://espresso3389.hatenablog.com/entry/20090526/124332747...

3: use TL: https://github.com/trueroad/tr-NTTtech05

4: use TL: https://anti-rugby.blogspot.com/2020/08/Computer001.html

chithanh 8 days ago

Seems not a good idea to roll your own. What if your software encounters U+2212 "MINUS SIGN" next?

Probably best to just transliterate to ASCII using gettext or unidecode or similar.

LegionMammal978 8 days ago

Does anyone here know of any actual Unicode-encoded documents that consistently use U+2010 HYPHEN for their hyphens? Among those documents that do distinguish between dash-like characters, the most common usage I've seen is to use U+002D HYPHEN-MINUS for hyphens and U+2212 MINUS SIGN for minus signs, alongside the rest of U+2013 EN-DASH, U+2014 EM-DASH, etc. U+2010 seems conspicuously absent from everything, even when the 'proper' character usage is otherwise adhered to.

lifthrasiir 8 days ago

I too haven't seen any natural use of U+2010. But some Unicode characters are equally underused, often because they are historical or designed for specialized or internal uses. Here U+2010 can be thought as a normalized form for U+002D after some processing to filter non-hyphens, which justifies its inclusion even when the character itself might not be used much.
red_admiral 8 days ago

TeX definitely distinguishes between -, -- and --- in text mode (hyphen, en dash, em dash); there are packages for language-specific quotes and hyphenation rules so there may be something out there that does this - ctan/smartmn specifically seems to be dealing with this kind of thing. Mind you, TeX also allows pretty arbitrary remapping of symbols.
- LegionMammal978 8 days ago
  
  Of course TeX also distinguishes between its dash-like characters. But I'm not talking about TeX but about Unicode, which is the one with the apparently-unused U+2010 HYPHEN.
tracker1 8 days ago

It will depend on the source for the input. Odds are every variation of minus and hyphen has appeared in every context at some point.
From a stylistic perspective, it may have been desired for a given appearance even if technically wrong. Just because of a given typeface. I say this as someone who was an artist before learning software programming.
mjevans 8 days ago

If you ever find any, it might be time to ask if a true General AI has been developed. I really doubt most humans bother, and LLMs will copy our mistakes.
- LegionMammal978 8 days ago
  
  My point is, there are penty of documents which bother with minus signs, en-dashes, and em-dashes, including Wikipedia, the Unicode Standard itself, and well-edited online news articles. Yet they still don't bother with U+2010 in particular, which makes me question the character's usefulness.
  - keybored 8 days ago
    
    For people/text authors who care, hyphen-minus is already hyphen-biased: most hyphen-minuses you encounter from average text authors (who don’t care) are meant to be hyphens. And for people who care it is even more slanted:
    - They will either use `--` or `---` as poor man’s en/em-dash or use the proper symbols
    - They might use the proper minus sign but even if they don’t: post-processing can guess what is meant as “minus” in basic contexts (and even for math-heavy contexts: hyphens aren’t that common)
    Furthermore hyphen-minus is rendered as a hyphen already. Not as minus or a dash.
    It’s like a process of elimination: people who care already treat non-hyphens sufficiently different such that the usage of hyphen-minus is clear: it is just hyphen.
    For me these things is mostly about looks and author-intent. Dashes look better than poor man’s dashes. Hyphen-minus looks like a hyphen minus already. And if I use hyphen-minus then I mean hyphen.
    And for me it is less about using the correct character at the expense of possible inter-operation: the hyphen-minus is so widespread that I have no idea if 95% of software will even cope with using the real HYPHEN Unicode scalar. (I very much doubt that!)
    The last thing is keyboard usability economics. I use en-dash/em-dash a few times per paragraph at most. Hyphens can occur several times a sentence. And since I need hyphen-minus as well (see previous point about interoperability) most keyboard setups will probably need to relegate it to some modifier keybind like AltGr-something… and no one has the patience for typing such a common symbol with a modifier combo.
  - adrian_b 8 days ago
    
    U+2010 has the advantage that it is not ambiguous and its appearance is predictable. You can never know whether a given typeface will display U+002D as a hyphen or as a minus or en-dash.
    The reason why it is seldom used is that all keyboards by default provide only a way to easily type U+002D and the other ASCII characters. The input methods may provide some combination of keys that allows you to enter a minus or an en-dash, but nobody bothers to add an additional key combination for U+2010. The U+002D key could be reconfigured to output U+2010, but this would annoy the programmers who use programming languages where U+002D is used for minus.
    So there is no way out of this mess. In programming languages or spreadsheets U+002D is used for minus, while in documents intended for reading, U+002D is used for hyphen, and the appropriate Unicode characters are used for minus and en-dash.
    An exception among programming languages was COBOL. Originally it used only a hyphen character, which was used to improve the readability of long identifiers. This was possible because the arithmetic operations were written with words, i.e. SUBTRACT, so there was no need for a minus character.
    A few years later (1964-12), when the PL/I language was developed to replace both FORTRAN and COBOL, they have introduced the underscore character, replacing the hyphen in long identifiers, where it was used to improve their readability like in COBOL, so that the hyphen/minus character could be used with the meaning of minus, like in FORTRAN. This convention has been inherited by most later programming languages, except by most dialects of LISP, which typically use a hyphen character in identifiers and they do not use a minus character, except for the sign of numbers.
    
    lispm 7 days ago
    
    > except by most dialects of LISP, which typically use a hyphen character in identifiers and they do not use a minus character, except for the sign of numbers.
    In Common Lisp, there is one character SP10 for Hyphen and Minus: https://www.lispworks.com/documentation/HyperSpec/Body/02_ac...
    It is used
    * in numbers as a sign -> -42
    * in a conditional read macro as a minus operator -> #-ARM64(error "This is no 64bit ARM platform")
    * in functions as a numeric minus operator -> (- 100 2/3) or (1- 100)
    * as a global variable for the currently evaluated REPL expression -> -
    * as a hyphen for symbols (incl. identifiers) -> UPDATE-INSTANCE-FOR-DIFFERENT-CLASS
    
    LegionMammal978 8 days ago
    
    > U+2010 has the advantage that it is not ambiguous and its appearance is predictable. You can never know whether a given typeface will display U+002D as a hyphen or as a minus or en-dash.
    The thing is, I've never found a single non-monospace typeface that displays U+002D as a minus sign or en-dash: it seems to be universally rendered shorter than a U+2212 or U+2013, whenever the latter have their own glyphs in the first place. I also did some testing on my system some time back, and 99% or more of typefaces treated a U+2010 identically to a U+002D. Only one or two displayed it a smidgeon shorter than a U+002D.
    Hence my original question about whether it really is used for that purpose (or any other purpose) in practice.
    Meanwhile, you do make a good point regarding programming languages. Though it would seem mostly coincidental to me that their use cases are almost always 'hyphen' or 'minus', as opposed to any of the other meanings of a 'typewriter dash'.

red_admiral 7 days ago

A safer way to approach any parsing task is to complain if you see a character you don't expect there. If there is a character in front of the dollar sign that is not whitespace, then something is going on and you need to take a look.

eviks 8 days ago

> Inspecting the hyphen. > I pulled in the standard library module unicodedata and starting checking things.

Or you could extend your editor to show Unicode character name in the status bar and do the inspection in a more immediate way

wonger_ 8 days ago

Or in vim, `ga` when hovered over a character

jstanley 8 days ago

But if they explicitly wrote a HYPHEN instead of a HYPHEN-MINUS or some other type of minus sign, doesn't that suggest it's actually not a minus sign and the number shouldn't be negative?

pornel 8 days ago

Unicode is not that semantic. It inherited ASCII (with no minus) and a ton of presentational (mis)uses of code points.
It's so messy that Unicode discourages use of APOSTROPHE for apostrophes, and instead recommends using RIGHT SINGLE QUOTATION MARK for apostrophes.
- oneeyedpigeon 8 days ago
  
  > Unicode discourages use of APOSTROPHE
  Blame fonts that render APOSTROPHE as a disgusting straight character.
  - pornel 7 days ago
    
    Because in ASCII it also plays a role of the left single quote, so you get a geometric compromise.
  - account42 7 days ago
    
    Surely you mean a pretty straight and symmetric character, the ideal all characters should aspire to.
chatmasta 8 days ago

Sure, misinterpreting user intent could cost a lot of money — $100 or more, if you’re not careful.

TristanBall 7 days ago

So, I guess it's only me who learned from the comments here that the was a difference between em dash and en dash? Or that they might be different from a hyphen or a minus?

(In my defence, I don't work in any of the specialized areas where it matters, and was raised in a poor, ascii only, western household.)

I will point out that spammers and scammers have been having a field day with this kind of character confusion for years now, and a lot of software still hasn't caught up to it.

On the bright side, the very old school database I babysit for work can be convinced to output utf8, including emoji, many of which render quite well in a terminal, allowing me to create bar graphs of poo or love heart characters, which honestly makes it all worth it for me.

lifthrasiir 8 days ago

To be clear, you weren't bitten by Unicode but bitten by bad Unicode usages. Which are prevalent enough that any text processing pipeline has to be fuzzy enough to recognize them. I have seen, for example, many uses of archaic Hangul jamo ㆍ (U+318D) in place of middle dots (U+00B7) or bullets (U+2022), while middle dots and bullets themselves are often confused to each other.

riffraff 8 days ago

Why bad? This is the intended use for this character
- lifthrasiir 8 days ago
  
  Hyphen is a distinct (but of course very commonly confused) character from minus, which Unicode separately encodes U+2212. Though it is also possible that an OCR somehow produced a hyphen out of nowhere.
- keybored 7 days ago
  
  Hyphens in front of numbers is not the intended use of hyphen. The PDFs have mangled the symbols.

l72 7 days ago

I wrote a web scraper to scrape products from some Vinyl Record Distributors. It is amazing to me how careless (or clueless) people are with various unicode characters.

I had huge amounts of rules for "unifying" unicode, so I could then run the result through various regular expressions. It wasn't just hyphens, but I'd run into all sorts of weird characters.

It all worked, but was very brittle, and constantly had to be tweaked.

In the end, I used a machine learning model, which I wrote about here[1]

[1] https://blog.line72.net/2024/07/31/the-joys-of-parsing-using...

langsoul-com 8 days ago

If anyone has worked with spreadsheets across Mac, Windows, Linux and various online ones. They're also a nightmare.

Some characters are encoded differently based on what system set it. So an if statement character comparison runs into the same misery as the author has :(

bobbylarrybobby 8 days ago

I'd be concerned about `value[0]` — how does that work in the face of multi byte characters? Is all string indexing in Python O(n)? Does it store whether a given string is ascii-only and switch to constant time lookup if it is?

Sniffnoy 8 days ago

Python 3 actually uses UTF-32, so it's all constant-time. A tradeoff few make, certainly!
- lifthrasiir 8 days ago
  
  Or more accurately, behaves as if it is UTF-32. The actual implementation uses multiple internal representations, just like JS engines emulating UCS-2.
  - Sniffnoy 8 days ago
    
    Huh! I was unaware (of both of those), thanks.
jerf 8 days ago

The cost of string indexing isn't relevant for a hard-coded zero index. It affects what you might get back but that's O(1) regardless of implementation.

userbinator 8 days ago

In my current font, that hyphen looks very slightly different from the normal ASCII one - it's just a pixel shorter and located a pixel lower. If I force the charset to CP1252 then I get â€ which is very obviously not a hyphen.

lynx23 8 days ago

And dont forget to check if your .startswith takes a regex, because -$ will give you unexpected headaches even without the multitude of hyphens.

riffraff 8 days ago

FWIW, python regex support checking for Unicode properties via the \p{SOME NAME} syntax, but as people said there's a lot more weird edge cases. Btw looks like the code may also have a couple lurking bugs (parsing floats vs decimals, implicit locale number formatting ).

I feel all "import data from multiple sources" I've seen in my life grew through repeated application of edge case handling.

jay-barronville 7 days ago

I don’t think fully relying on the Pd Unicode category is ideal though. For example, I don’t think you’d want U+2E17 to be matched too.

I think the best solution would be to match by some specific code points, and then throw an error when a strange code point is encountered.

I think it’s a mistake to try to handle every edge case in this particular case.

ReleaseCandidat 8 days ago

That's why you don't want to use regexes to parse something, but an actual tokenizer which uses Unicode _k_ompatibility normalisation (NFKC or NFKD) for comparisons. Although I'm not sure if that works with the Dollar emoji (which HN doesn't like to display)

riffraff 8 days ago

Regexes can handle Unicode categories just fine, if he'd written a tokenizer it would still have failed. Which in fact it did when he removed the regex.
- ReleaseCandidat 8 days ago
  
  It's not about categories they don't help with such problems, but comparison using compatible normalisation (either NFKC or NFKD). Using that, e.g. 1 compares equal to I and i (the roman number literals) the 1 in a circle and all the other Unicode code points which have the samé meaning.
  - riffraff 8 days ago
    
    but that's not about using a tokenizer vs a regex, it's about using a normalization step, which would also work with the regex.
    
    ReleaseCandidat 8 days ago
    
    Yes, that's true. Except it isn't (well, doesn't have to be) an extra step in the tokenizer. Most of the time you do not want to run the whole string or part of it through the kompatibility normalization but just some code points (like the sign of a float). Which could of course be done with a matchgroup of a regexp too. I have just made the observationthrough the last two decades, that it's easier to not forget about such cases when not using regexps.

makach 8 days ago

*Bitten by regex

pino82 8 days ago

Bitten by MS Outlook... ^^

emmelaich 8 days ago

Indeed, very familiar to those who copy code from Outlook, Word, and some websites.
Even some manpages had (or still do) have hyphen instead of minus for the option character. Argh!
- blueflow 8 days ago
  Known bug: https://lists.debian.org/debian-devel/2023/10/msg00085.html
  This issue does indeed have a history of provoking unhinged lunacy.

evOve 8 days ago

They should simply drop HYPHEN (U+2010). I don't see the purpose of having extra identity

lifthrasiir 8 days ago

U+2010 was a very early addition to Unicode (1.1) and Unicode characters are never removed once encoded, even when there are glaring errors [1].
[1] https://www.unicode.org/policies/stability_policy.html
keybored 7 days ago

The purpose is to have an unambiguous character for where that matters. This has been covered.
draw_down 8 days ago

[dead]

bluecalm 8 days ago

Reading my code from some years ago I can see I was very frustrated by a similar problem when parsing .csv files from some financial institutions:

    # converts idiotic number format containing random junk into normal
    # represantion of a number
    def junkno_to_normal(s):
    (...)

There are so many random characters you can insert into dates or numbers. Not only hyphens but also all kind of white spaces or invisible characters. It's always a warm feeling when you import a text document and not only it's encoded in UTF-8 but you see YYYY-MM-DD date format. You know it's going to be safe from there. Unfortunately it's still very rare in my experience (even UTF-8 bit).

cooolbear 7 days ago

> One product of mine takes reports that come in as a table that’s been exported to PDF

Here's the first problem!

I can't believe actual businesses think that a report is anything other than for human eyes to look at. Typesetting (and report generation) is for presentation, and otherwise data should be treated like data.

I mean it's a different story if the product is like "we can help process your historical, improperly-formatted documents", but if it's from someone continually generating reports, somebody really should step in and make things more... computational.

Retr0id 8 days ago

Fun fact, ISO 8601 says you should use U+2212 MINUS to express timestamps with negative timezone offsets. At least, I think it does, I'm going off the Wikipedia description: https://en.wikipedia.org/wiki/ISO_8601#Other_time_offset_spe...

lifthrasiir 8 days ago

In my understanding that is a misunderstanding. I previously commented about that [1], but in short: a combined hyphen-minus character should be used for any charset based on ISO/IEC 646, which includes Unicode.
[1] https://news.ycombinator.com/item?id=37346702