"R.Wieser" <***@not.available>
| > Unicode - 2-byte characters as used in Windows,
|
| Which I remember as "Wide character" (as the name used in the conversion
| function).
|
Yes. That one confuses me. It took me a long
time to figure out which was wide and which was multi.
Dual and multi would have made more sense. I don't
find a spatial description of bytes to be intuitive. A
2-byte character is not "fat".
| > which may not be the same as all unicode 16 and
|
| Did I already mention I'm hazy with those names ? Whell, stuff like that
| (two-byte unicode != unicode 16) certainly does that to me. :-)
|
For a long time I had no awareness of anything other
than "unicode", which was the 16-bit, 2-bytes-per-character
that Win32 uses internally, notable not only for the double
byte characters but also for the prepended, 4-byte length
indicator that allowed for embedded nulls.
The term "unicode 16" is only made necessary by the invention
of unicode 32. If everyone would just speak English like normal
people we wouldn't have this mess. :)
I always heard/read Windows programming people talking
about simply "unicode". My assumption is that at the time it
was thought that, to paraphrase the Gatester, "64,000
characters should be enough for anyone". And anyway, no
one actually used unicode, except *maybe* if they were writing
software for Asians, Africans, Israelis, etc. So basically it was
preparation for the future.
My text files are all ASCII/ANSI to this day.
Not all software even recognizes unicode. Then UTF-8 brings
in further complication because an encoding indicator is
discouraged. So Notepad can see a file as plain text unless
there's a BOM at the beginning, in which case it's unicode.
But how does Notepad or anything else recognize UTF-8?
If I save a file as UTF-8 in Notepad it wll be prepended with
EF BB BF, but webpages don't have that. So it ends up creating
a politically correct culture war: We shouldn't use ANSI because
it's language-specific. We should use UTF-8, even if it screws
things up, because UTF-8 respects "diversity".
| Yep, I do know. And for some reason I got the idea that UTF-8 was
referring
| to that encoding scheme. I normally refer to it as "multi byte" (again,
| from the conversion function).
I suppose it is multi-byte. And there is a UTF-8 codepage.
But it's unicode insofar as it assigns unique numbers for
all characters. So it's not really a codepage in the ANSI sense
of detailing what characters bytes 128-255 should map to.
| Than again, I seem to vaguely remember that UTF-16 (two bytes per
character)
| could do the same "multi byte" encoding ...
|
I don't think so. Not on Windows.
| > and if they used UTF-8 it would potentially change
| > the number of characters when rendered as ANSI
|
| Yep. Which I would/do not find strange in any way. The same happens
with
| C strings, in which you have to escape certain characters (gave me quite a
| puzzle the first time I encountered it). :-)
|
Not strange, but problematic. In the world of late
90s, early 00s, when people were mostly only thinking
about Euro languages, where there was either one byte
or 2 bytes per character, it's not too hard to convert
between ANSI and unicode. The first byte was always 0. :)
But if real world ANSI usage were actually multibyte then
it would quickly get complicated to deal with text. Of
course it is complicated now, in theory, but mostly only
for Asians, in practice.
I ended up writing a VB6 function for my HTML editor to
check for UTF-8. I find it takes less than 15 ms to check
up to 100KB of data, so it's an almost instant ID, which
allows me to support UTF-8 seamlessly. I open the file
and inspect the bytes before loading it into the RichEdit,
at which point I have to tell the RichEdit how to load it.
But there are still complications. This is for HTML so it
assumes an ANSI-type file. In other words, not unicode-16
and without a BOM. It only searches until it finds, or
doesn't find, a byte combination invalid in UTF-8.
Public Function IsItUTF8(sFile As String) As Boolean
Dim bFile() As Byte
Dim iB As Long, SizFile As Long, LenF As Long
Dim FF As Integer
Dim BooU8 As Boolean, BooU8Char As Boolean
IsItUTF8 = False
On Error Resume Next
FF = FreeFile()
Open sFile For Binary As #FF
LenF = LOF(FF)
If LenF > 100000 Then
ReDim bFile(100000) As Byte
Else
ReDim bFile(LenF) As Byte
End If
Get #FF, , bFile()
Close #FF
'--just quit and call it ansi if there's an error opening file.
If Err.Number <> 0 Then Exit Function
SizFile = UBound(bFile) - 3
If SizFile < 10 Then Exit Function '-- don't go negative for a tiny
file.
BooU8Char = False
BooU8 = True
iB = 0
'-- UTF-8 characters will be: 240+/128+/128+/128+ 224+/128+/128+
192+/128+
'-- anything not fitting that pattern will not be a UTF-8
character. So
'-- a single byte over 127, a byte over 240 not followed by 3 bytes
over 127, etc.
'-- Most functions like this are designed to default to UTF-8: If
it's not
'-- *faulty* UTF-8 then it's UTF-8. This function does it the other
way:
'-- If it's faulty UTF-8 or if it's ASCII then it's not UTF-8.
Do While iB < SizFile
Select Case bFile(iB)
Case Is < 128 'ascii range
iB = iB + 1
Case Is < 194, Is > 244 '128-191 can only appear as continuation
bytes.
BooU8 = False '245 to 255 are invalid in utf-8. 192, 193
are invalid.
Exit Do
Case Is > 239
If ((bFile(iB + 1) < 128) Or (bFile(iB + 1) > 191)) _
Or ((bFile(iB + 2) < 128) Or (bFile(iB + 2) > 191)) _
Or ((bFile(iB + 3) < 128) Or (bFile(iB + 3) > 191)) Then
BooU8 = False
Exit Do
Else
BooU8Char = True
End If
iB = iB + 4
Case Is > 223
If ((bFile(iB + 1) < 128) Or (bFile(iB + 1) > 191)) _
Or ((bFile(iB + 2) < 128) Or (bFile(iB + 2) > 191)) Then
BooU8 = False
Exit Do
Else
BooU8Char = True
End If
iB = iB + 3
Case Else ' > 193 and < 224
If (bFile(iB + 1) < 128) Or (bFile(iB + 1) > 191) Then
BooU8 = False
Exit Do
Else
BooU8Char = True
End If
iB = iB + 2
End Select
Loop
If BooU8 = False Or BooU8Char = False Then
IsItUTF8 = False
Else
IsItUTF8 = True
End If
End Function