file.ReadAll - another quirk

Discussion:

(too old to reply)

R.Wieser

2019-11-20 11:54:24 UTC

Hello all,

Just a heads-up:

Some time ago there was a discussion involving file.ReadAll having a problem
with files binary binary contents.

Just today I found that it also has problems with absolutily nothing. That
is, when reading an empty file. :-) It than throws a "reading past the end
of file" error.

In short, there are more reasons to evade using 'ReadAll' ...

Regards,
Rudy Wieser

Mayayana

2019-11-20 13:39:57 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| Just today I found that it also has problems with absolutily nothing.
That
| is, when reading an empty file. :-) It than throws a "reading past the
end
| of file" error.
|

Thanks. I never noticed that before. I guess it shows
once more how seriously Microsoft took IT people in 1998.
They apparently hired a handful of college students to
write scrrun.dll and assumed that no one would ever need
it for more than parsing a log file.

2019-11-21 17:00:14 UTC

Permalink

Post by R.Wieser
Hello all,
Some time ago there was a discussion involving file.ReadAll having a problem
with files binary binary contents.
Just today I found that it also has problems with absolutily nothing. That
is, when reading an empty file. :-) It than throws a "reading past the end
of file" error.
In short, there are more reasons to evade using 'ReadAll' ...
Regards,
Rudy Wieser

IMO, that's not a quirk. ReadAll() does what it's described:

"Returns all characters from an input stream."

It doesn't say "if any".

So, if you make it read an empty file, it'll assume that there's data in the
file, but there is none. Thus the triggered exception.

Mayayana

2019-11-21 17:50:42 UTC

Permalink

"JJ" <***@vfemail.net> wrote

| So, if you make it read an empty file, it'll assume that there's data in
the
| file, but there is none. Thus the triggered exception.

I'd expect a zero-length string. But to be honest,
I've never really given it any thought. Typically I
add "on error resume next" to my scripts after they're
all done and polished. The reason for that is just this
kind of thing -- a silly error that's probably not going to
affect the result, and I don't want it to stop the
whole script.

In real world usage such errors are actually common.
For instance, if I want to clean my TEMP folders I'll
get errors about files in use. But if I use a script to
do it and add error trapping then it completes smoothly
and deletes everything that it can delete.

I don't think it ever occurred to me to do
something like, s = TS.ReadAll: If Len(s) > 0 then...

But with error trapping I guess the result is the same.
I typically wrap it in a small function. s = ReadFile(path).
So with error trapping in that function, for a blank
file I'll get back "".

R.Wieser

2019-11-21 19:39:52 UTC

Permalink

JJ,
Nope.

Post by JJ
"Returns all characters from an input stream."

Correct. That would normally mean that the result would be an empty string,
not an "reading past the end of file" error.

Besides, even the error is wrong: If there is nothing to read why is it than
(forcefully) trying to do so anyway ?

Post by JJ
It doesn't say "if any".

Please, do *not* go the "they have not explicitily said that ..." road. You
see, the reverse is also true, with the end result being that pretty-much
anything goes, even random results that have nothing to do with the provided
filename

Heck, it would even include deleting the file or converting its contents
into runes.

Post by JJ
So, if you make it read an empty file, it'll assume that there's data in
the file, but there is none. Thus the triggered exception.

Than explain to me why it can stop exactly at the end of /any/ contents, no
matter how short or long, but is too stupid to detect a string-length of
zero. That simply does not make any sense.

No kiddo, as far as I can tell you're currently just trolling.

Regards,
Rudy Wieser

Mayayana

2019-11-21 20:04:25 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| Than explain to me why it can stop exactly at the end of /any/ contents,
no
| matter how short or long, but is too stupid to detect a string-length of
| zero. That simply does not make any sense.
|

Actually, as we've found before, it doesn't seem to
do that. It assumes content without checking the file
size. Then it reads until it hits a null. I assume it's
using API like CreateFile/ReadFile (or maybe VC++
functions) to get the file into a string. Then it sees
the end of the string as the first null, unless given
a specific string length to return.

I think all you can do is say, "Thanks, Scripting Guys!"

As I understand it, they're the bright bulbs who were
tasked with creating scrrun *and wrote it in VC6++!*.
So it's a Windows system library but it has to load the
VC6++ runtime.

R.Wieser

2019-11-21 20:46:59 UTC

Permalink

Mayayana,

Post by Mayayana
Actually, as we've found before, it doesn't seem to
do that. It assumes content without checking the file
size.

Not quite. Remember that it returns a string the size of the file, but
that its contents are buggy ?

Post by Mayayana
Then it reads until it hits a null.

Again, not quite. Its the copying which is done after reallocating for a
bigger blob which is the culprit. Yes, I disassembled the involved code.
:-)

And just print out all the bytes in the string one-by-one and take a look at
the ones beyond the last whole multiple of ... 260 IIRC. Compare with the
last part of the file. You will see that they match. In other words, it
does read the whole thing.

Post by Mayayana
I think all you can do is say, "Thanks, Scripting Guys!"

/That/ I do agree with you. :-)

Regards,
Rudy Wieser

Mayayana

2019-11-21 21:31:32 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| > Then it reads until it hits a null.
|
| Again, not quite. Its the copying which is done after reallocating for a
| bigger blob which is the culprit. Yes, I disassembled the involved code.
| :-)
|

Ah. I never noticed that. It does actually get the entire file.
So what's different? If I ReadAll a GIF or Read(filelen) a GIF
I apparently get the same bytes. But the former will stop at the
first null when I try to read it while the latter will not. I don't
understand what you mean by "copying after reallocating".

R.Wieser

2019-11-22 08:23:46 UTC

Permalink

Mayayana,

But the former will stop at the first null when I try to read

I'm not sure what you mean with "when I try to read" there. The whole file
gets processed, as is proven by the last bytes of the resulting malformed
string matching the last bytes of the file.

If I ReadAll a GIF or Read(filelen) a GIF I apparently get the same bytes.

That fully depends on the size of the file. The readall code internally
works in increments of 260 bytes. Take a file of at least that size and the
corruption will take place.

But the former will stop at the first null when I try to read it
while the latter will not.

Again, /it doesn't stop/.

I don't understand what you mean by "copying after reallocating".

The "readall" method goes (simplified) like this:

- while not at EOF
- try to read 260 bytes of data
- convert from utf-8 to widestring
- allocate a new blob of memory the size of the old one plus the size of
the wide string
- copy the contents of the old blob into the new blob <=== !!
- copy the widestring into the new blob (at an offset equal to the old blobs
size)
- wend
- return contents of the "new blob"

It goes wrong at the "copy the contents of the old blob into the new blob"
point. That copy routine stops at the first zero (leaving the remainder of
the "new blob" undefined) when it should just have copied the whole
specified block (copy X bytes from Y to Z).

Regards,
Rudy Wieser

Mayayana

2019-11-22 14:14:33 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| It goes wrong at the "copy the contents of the old blob into the new blob"
| point. That copy routine stops at the first zero (leaving the remainder
of
| the "new blob" undefined) when it should just have copied the whole
| specified block (copy X bytes from Y to Z).
|

I see. Thanks. I probably never would have guessed it
was multiple reads. I've seen that kind of activity before
in Filemon logs but don't understand why a file would be
read so inefficiently. Maybe it's a relic of ReadFile API?

Initially I'd had problems with file content
seeming to stop at the first null. I didn't look further because
the method was clearly unusable. But I also assumed it
was a problem of assuming string content and thus FSO looking
for a null as end of string, rather than reading a specified
number of bytes. From your description it appears that is
the problem, but in a more complicated way than I imagined.

R.Wieser

2019-11-22 14:58:24 UTC

Permalink

Mayayana,

but don't understand why a file would be read so inefficiently.
Maybe it's a relic of ReadFile API?

Its possibly related to yester-years low-memory 'puters (in comparision to
the current ones). Reading everything and only than convert costs three
times the size of the file (the data itself and the resulting wide-string
output). Than again, the rather inefficient memory management (adding
bite-sized parts) can't be good either ...

From your description it appears that is the problem, but in a
more complicated way than I imagined.

:-) Yep.

Regards,
Rudy Wieser

2019-11-22 12:13:56 UTC

Permalink

Post by R.Wieser
Mayayana,

Post by Mayayana
Actually, as we've found before, it doesn't seem to
do that. It assumes content without checking the file
size.

Not quite. Remember that it returns a string the size of the file, but
that its contents are buggy ?

Post by Mayayana
Then it reads until it hits a null.

Again, not quite. Its the copying which is done after reallocating for a
bigger blob which is the culprit. Yes, I disassembled the involved code.
:-)

That's a character translation problem which is related to character
encoding. It's an entirely different (buggy) matter.

R.Wieser

2019-11-22 14:47:36 UTC

Permalink

JJ,

Post by JJ

Post by R.Wieser
Again, not quite. Its the copying which is done after reallocating for a
bigger blob which is the culprit. Yes, I disassembled the involved code.

That's a character translation problem which is related to
character encoding. It's an entirely different (buggy) matter.

Nope, its not. The block-copying code doesn't translate anything.

Besides, the problem doesn't appear for small (< 260 chars) files, or for
the last few bytes (filesize modulo 260) - and that includes any embedded
zeroes. If the encoding would have been the problem that would not have
been possible.

Regards,
Rudy Wieser

Mayayana

2019-11-22 15:10:48 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| > That's a character translation problem which is related to
| > character encoding. It's an entirely different (buggy) matter.
|
| Nope, its not. The block-copying code doesn't translate anything.
|

That brings up another issue. UTF-8 was pretty much
unknown, maybe non-existent, when scrrun came out.
What FSO does is, I think, what VB6 does: Externally
it defaults to codepage ANSI while internally it stores
strings as unicode. (The help file says it defaults to ASCII,
but actually it's ANSI.)

That explains what you saw. Bulking up to a unicode
string as it reads in. But I very much doubt something
like 80 CE 32 would be translated to the u-16 equivalent.
It just comes through (fortunately) as 3 ANSI characters.

R.Wieser

2019-11-22 17:37:18 UTC

Permalink

Mayayana,

Post by Mayayana
That brings up another issue. UTF-8 was pretty much
unknown, maybe non-existent, when scrrun came out.

Well, you got me there I'm afraid. I'm still a bit hazy about the names of
the different multi-byte encodings.

The "readall" code pulls the read bytes thru the MultiByteToWideChar
kernel32 function, and stores the two-bytes-per-character result.

Post by Mayayana
But I very much doubt something like 80 CE 32 would be
translated to the u-16 equivalent. It just comes through
(fortunately) as 3 ANSI characters.

I thought that the above MultiByteToWideChar call would take care of that.
Though there is a possibility that the "readall" code checks for a UTF-8
header (EF BB BF) before setting a flag to do so. (I really should
re-examine the disassembled code some time ...)

Regards,
Rudy Wieser

Mayayana

2019-11-22 23:40:29 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| I'm still a bit hazy about the names of
| the different multi-byte encodings.

I'm using the Windows terms, which are not always
the same as what other people use.

ASCII - bytes 0-127, which are always the same,
in any encoding, but are paired with nulls in unicode.

ANSI - bytes 0-255, in which 128+ are rendered
according to the local codepage while 0-127 match
ASCII. So English speakers (and I think most
Europeans) get specific symbols for 128+, but
Russians and Turks, for example, get characters in
their language.
The Asian multi-byte languages are the only
exception. They can have more than 1 byte per
character in their codepage.

That's how I ended up using FSO for VBS binary
operations. If it's handled carefully, and you're not
using an Asian multibyte codepage, then it works.
It really doesn't matter whether Windows thinks the
byte represents a dollar sign or an Arabic character.

Unicode - 2-byte characters as used in Windows,
which may not be the same as all unicode 16 and
is not the same as unicode-32.

As far as I know, in Windows generally, only ANSI
and unicode are relevant. Win32 is using unicode
under the covers but provides ANSI as default for
VB, VBS, older versions of notepad, etc. (As you may
know, in VB it's actually not easy to access the unicode
version. One must use the string pointer directly because
when the variable is referenced there's an automatic
conversion to/from unicode.)

It gets confusing because
"multi-byte" sounds like unicode but instead refers
to ANSI encoding which *could* use multiple bytes.
(You probably know that, too, but I'm not sure how
many others do.)

UTF-8 - That one is fairly new to me. I've heard
it's now the standard for plain text on Linux. And it's
become the standard for plain text in webpages. For
obvious reasons: The vast majority of webpages are
valid ASCII, anyway. And ASCII matches the 0-127
in UTF-8. So there's no upset in switching, except for
the people who want to do things like use curly braces
in UTF-8, which render as gibberish in ANSI.

I added UTF-8 support to my own HTML editor since
it's now standard. The editor uses a RichEdit window.
But interestingly, support for UTF-8 in RichEdit seems
to be new and is almost entirely undocumented. I just
happened to come across a note somewhere. It wasn't
listed in the official docs. But I tried the sample code
I found, to load a file as UTF-8, and it worked.

Of course, that's only partially useful. If the chosen font
is not unicode it makes no difference! And I think the only
unicode font I have is MS Arial. I like Verdana for coding.
So I don't get the benefit of my own UTF-8 support. :)

| > But I very much doubt something like 80 CE 32 would be
| > translated to the u-16 equivalent. It just comes through
| > (fortunately) as 3 ANSI characters.
|
| I thought that the above MultiByteToWideChar call would take care of that.
| Though there is a possibility that the "readall" code checks for a UTF-8
| header (EF BB BF) before setting a flag to do so. (I really should
| re-examine the disassembled code some time ...)
|

First parameter is codepage. Surprisingly, UTF-8
is one possibility there. But I'd guess they're using
ANSI codepage. That's the way it seems to come through
and if they used UTF-8 it would potentially change
the number of characters when rendered as ANSI
(or what the help is calling ASCII.)

R.Wieser

2019-11-23 08:19:37 UTC

Permalink

Mayayana,

Post by Mayayana
I'm using the Windows terms, which are not always
the same as what other people use.

That (different names for the same encoding) might well be part of my
problem ...

I ofcourse know ASCII, and later on ANSI.

Post by Mayayana
Unicode - 2-byte characters as used in Windows,

Which I remember as "Wide character" (as the name used in the conversion
function).

Post by Mayayana
which may not be the same as all unicode 16 and

Did I already mention I'm hazy with those names ? Whell, stuff like that
(two-byte unicode != unicode 16) certainly does that to me. :-)

Post by Mayayana
It gets confusing because "multi-byte" sounds like
unicode but instead refers to ANSI encoding which
*could* use multiple bytes.
(You probably know that, too, but I'm not sure how
many others do.)

Yep, I do know. And for some reason I got the idea that UTF-8 was referring
to that encoding scheme. I normally refer to it as "multi byte" (again,
from the conversion function).

Than again, I seem to vaguely remember that UTF-16 (two bytes per character)
could do the same "multi byte" encoding ...

Post by Mayayana
and if they used UTF-8 it would potentially change
the number of characters when rendered as ANSI

Yep. Which I would/do not find strange in any way. The same happens with
C strings, in which you have to escape certain characters (gave me quite a
puzzle the first time I encountered it). :-)

Regards,
Rudy Wieser

Mayayana

2019-11-23 15:09:26 UTC

Permalink

"R.Wieser" <***@not.available>

| > Unicode - 2-byte characters as used in Windows,
|
| Which I remember as "Wide character" (as the name used in the conversion
| function).
|

Yes. That one confuses me. It took me a long
time to figure out which was wide and which was multi.
Dual and multi would have made more sense. I don't
find a spatial description of bytes to be intuitive. A
2-byte character is not "fat".

| > which may not be the same as all unicode 16 and
|
| Did I already mention I'm hazy with those names ? Whell, stuff like that
| (two-byte unicode != unicode 16) certainly does that to me. :-)
|

For a long time I had no awareness of anything other
than "unicode", which was the 16-bit, 2-bytes-per-character
that Win32 uses internally, notable not only for the double
byte characters but also for the prepended, 4-byte length
indicator that allowed for embedded nulls.

The term "unicode 16" is only made necessary by the invention
of unicode 32. If everyone would just speak English like normal
people we wouldn't have this mess. :)

I always heard/read Windows programming people talking
about simply "unicode". My assumption is that at the time it
was thought that, to paraphrase the Gatester, "64,000
characters should be enough for anyone". And anyway, no
one actually used unicode, except *maybe* if they were writing
software for Asians, Africans, Israelis, etc. So basically it was
preparation for the future.
My text files are all ASCII/ANSI to this day.
Not all software even recognizes unicode. Then UTF-8 brings
in further complication because an encoding indicator is
discouraged. So Notepad can see a file as plain text unless
there's a BOM at the beginning, in which case it's unicode.
But how does Notepad or anything else recognize UTF-8?
If I save a file as UTF-8 in Notepad it wll be prepended with
EF BB BF, but webpages don't have that. So it ends up creating
a politically correct culture war: We shouldn't use ANSI because
it's language-specific. We should use UTF-8, even if it screws
things up, because UTF-8 respects "diversity".

| Yep, I do know. And for some reason I got the idea that UTF-8 was
referring
| to that encoding scheme. I normally refer to it as "multi byte" (again,
| from the conversion function).

I suppose it is multi-byte. And there is a UTF-8 codepage.
But it's unicode insofar as it assigns unique numbers for
all characters. So it's not really a codepage in the ANSI sense
of detailing what characters bytes 128-255 should map to.

| Than again, I seem to vaguely remember that UTF-16 (two bytes per
character)
| could do the same "multi byte" encoding ...
|
I don't think so. Not on Windows.

| > and if they used UTF-8 it would potentially change
| > the number of characters when rendered as ANSI
|
| Yep. Which I would/do not find strange in any way. The same happens
with
| C strings, in which you have to escape certain characters (gave me quite a
| puzzle the first time I encountered it). :-)
|
Not strange, but problematic. In the world of late
90s, early 00s, when people were mostly only thinking
about Euro languages, where there was either one byte
or 2 bytes per character, it's not too hard to convert
between ANSI and unicode. The first byte was always 0. :)
But if real world ANSI usage were actually multibyte then
it would quickly get complicated to deal with text. Of
course it is complicated now, in theory, but mostly only
for Asians, in practice.

I ended up writing a VB6 function for my HTML editor to
check for UTF-8. I find it takes less than 15 ms to check
up to 100KB of data, so it's an almost instant ID, which
allows me to support UTF-8 seamlessly. I open the file
and inspect the bytes before loading it into the RichEdit,
at which point I have to tell the RichEdit how to load it.
But there are still complications. This is for HTML so it
assumes an ANSI-type file. In other words, not unicode-16
and without a BOM. It only searches until it finds, or
doesn't find, a byte combination invalid in UTF-8.

Public Function IsItUTF8(sFile As String) As Boolean
Dim bFile() As Byte
Dim iB As Long, SizFile As Long, LenF As Long
Dim FF As Integer
Dim BooU8 As Boolean, BooU8Char As Boolean

IsItUTF8 = False
On Error Resume Next
FF = FreeFile()
Open sFile For Binary As #FF
LenF = LOF(FF)
If LenF > 100000 Then
ReDim bFile(100000) As Byte
Else
ReDim bFile(LenF) As Byte
End If
Get #FF, , bFile()
Close #FF

'--just quit and call it ansi if there's an error opening file.
If Err.Number <> 0 Then Exit Function
SizFile = UBound(bFile) - 3
If SizFile < 10 Then Exit Function '-- don't go negative for a tiny
file.
BooU8Char = False
BooU8 = True
iB = 0
'-- UTF-8 characters will be: 240+/128+/128+/128+ 224+/128+/128+
192+/128+
'-- anything not fitting that pattern will not be a UTF-8
character. So
'-- a single byte over 127, a byte over 240 not followed by 3 bytes
over 127, etc.
'-- Most functions like this are designed to default to UTF-8: If
it's not
'-- *faulty* UTF-8 then it's UTF-8. This function does it the other
way:
'-- If it's faulty UTF-8 or if it's ASCII then it's not UTF-8.
Do While iB < SizFile
Select Case bFile(iB)
Case Is < 128 'ascii range
iB = iB + 1

Case Is < 194, Is > 244 '128-191 can only appear as continuation
bytes.
BooU8 = False '245 to 255 are invalid in utf-8. 192, 193
are invalid.
Exit Do

Case Is > 239
If ((bFile(iB + 1) < 128) Or (bFile(iB + 1) > 191)) _
Or ((bFile(iB + 2) < 128) Or (bFile(iB + 2) > 191)) _
Or ((bFile(iB + 3) < 128) Or (bFile(iB + 3) > 191)) Then
BooU8 = False
Exit Do
Else
BooU8Char = True
End If
iB = iB + 4

Case Is > 223
If ((bFile(iB + 1) < 128) Or (bFile(iB + 1) > 191)) _
Or ((bFile(iB + 2) < 128) Or (bFile(iB + 2) > 191)) Then
BooU8 = False
Exit Do
Else
BooU8Char = True
End If
iB = iB + 3

Case Else ' > 193 and < 224
If (bFile(iB + 1) < 128) Or (bFile(iB + 1) > 191) Then
BooU8 = False
Exit Do
Else
BooU8Char = True
End If
iB = iB + 2

End Select
Loop

If BooU8 = False Or BooU8Char = False Then
IsItUTF8 = False
Else
IsItUTF8 = True
End If
End Function

R.Wieser

2019-11-23 16:36:12 UTC

Permalink

Mayayana,

notable not only for the double byte characters but also
for the prepended, 4-byte length indicator that allowed
for embedded nulls.

The prepending of the length of the string is not universal thruout Windows.
For instance, all of the OSes DLLs I know of use zero-terminated strings.
As for in the script-engine ? As I now there those length-prepended
strings are referred to as BStr (no idea where that "B" comes from by the
way).

The term "unicode 16" is only made necessary by the invention
of unicode 32. If everyone would just speak English like normal
people we wouldn't have this mess. :)

Shucks. And I have been always wondering why babies all over the world get
fed all those strange languages and not let them just speak Dutch, the
language they are born with (guess which nationality I have). :-p

My text files are all ASCII/ANSI to this day.

Other than "back than" when the graphical single and double-line wall
characters where considered default I can't remember having used amything
else than straight ASCII.

If I save a file as UTF-8 in Notepad it wll be prepended
with EF BB BF, but webpages don't have that.

Don't be too sure about that last part. I'm in the habit of saving
webpages with information I think could disappear and than clean them up,
and now-and-again have to remove such a BOM.

| could do the same "multi byte" encoding ...
|
I don't think so. Not on Windows.

I should have stressed the "could" a bit more. I have not yet seen it being
used anywhere either. Possibly in one of the (asian) countries where a
/lot/ of characters (much more than our meager ASCII set) is the norm.

Public Function IsItUTF8(sFile As String) As Boolean

[snip code]

I've done something similar (in VBScript), but for the purose of converting
multi-byte sequences (in the above mentioned saved HTML pages) into &#????;
ones - the latter one normally displays something readable, while the former
does not. Though I masked the upper bits of the first byte and checked,
after which O OR-ed the folowing bytes together, and only than compared with

128. Not sure if that that would be faster or slower though.

Regards,
Rudy Wieser

2019-11-22 12:10:58 UTC

Permalink

Post by R.Wieser
JJ,
Nope.

Post by JJ
"Returns all characters from an input stream."

Correct. That would normally mean that the result would be an empty string,
not an "reading past the end of file" error.
Besides, even the error is wrong: If there is nothing to read why is it than
(forcefully) trying to do so anyway ?

Post by JJ
It doesn't say "if any".

Please, do *not* go the "they have not explicitily said that ..." road. You
see, the reverse is also true, with the end result being that pretty-much
anything goes, even random results that have nothing to do with the provided
filename
Heck, it would even include deleting the file or converting its contents
into runes.

Post by JJ
So, if you make it read an empty file, it'll assume that there's data in
the file, but there is none. Thus the triggered exception.

Than explain to me why it can stop exactly at the end of /any/ contents, no
matter how short or long, but is too stupid to detect a string-length of
zero. That simply does not make any sense.
No kiddo, as far as I can tell you're currently just trolling.
Regards,
Rudy Wieser

Oh, I remember...
It depends on whether the source file is a character or block/disk device.
The length of a block device is known, while a character device's is
unknown.
The exception will be thrown when the data length is known and is zero.