Textstream Write

Discussion:

Textstream Write

(too old to reply)

Mayayana

2019-04-29 12:55:57 UTC

Very weird problem. I've been working on an HTA that
allows me to touch-up edit webpages, in WYSIWYG manner.
It starts by loading what's in the BODY into a DIV, then
storing the header and end. When I save changes it takes
the innerHTML from the DIV, adds back the header and end,
and writes to disk with normal Textstream operation.

The whole thing works fine but I was testing it on various
files and came across one (an article from theregister.co.uk
that has no particular odd qualities) that won't work.

I load the page fine but saving changes results in a blank file.
Empty. If I show the 3 parts in msgbox they all look right. If
I right each part to disk, the BODY section is empty but both
ends write OK. When I write the whole file it writes the blank
file and then errors with "invalid procedure call or argument".
That seems to be at Write, though I'm not certain because error
line numbers can be off. The code is in an external VBS.

sMid = MainDiv.innerHTML
sContent = sBeg & sMid & sEnd

Then it calls the Write routine in a class. All very vanilla:

Public Sub WriteFile(sPath, sContent)
'on error resume next
If FSOcc.FileExists(sPath) = True Then FSOcc.DeleteFile sPath, True
Set TScc = FSOcc.CreateTextFile(sPath, True)
TScc.Write sContent
TScc.Close
Set TScc = Nothing
End Sub

The only thing I can even imagine causing trouble would
be a null, but I don't see how a null could get into the text.
Even then, the file should write up to the null. And this happens
even if I've made no edits to the page content. (Though IE
does make edits.) I've also tried removing script before loading.
Nothing works. It always comes out a zero length file.

Here's the webpage, for what it's worth:
https://www.theregister.co.uk/2019/04/26/windows_10_storage/

I just picked it randomly, downloading only the HTML.

2019-04-30 10:45:11 UTC

Permalink

This post might be inappropriate. Click to display it.

R.Wieser

2019-04-30 11:21:18 UTC

Permalink

JJ, Mayayana,

Post by JJ
Interresting problem.
The initial cause is because of the "¡¿" (code point 0x25BC; black
pointing-down triangle) character which follows the "SHARE" link

With this explanation I realized I bumped into the very same when I used the
"Microsoft.XMLHTTP" (and other, similar ones) to download a webpage and save
the contents.

Although I could display (wscript.echo, MsgBox) the contents with no
problems, there was no way I could write it to an UTF8 file. The only
solution was to write it in Windows Wide-character format. :-(

In other words: Although the object read a multi-byte character webpage with
no problems, it could not be saved as the same. :-(

The "solution" I came up with was to replace all non-ASCII characters with
HTML encodings representing their values:

function WSToHtml(sSource)
dim sTemp,i,sChar

sTemp=sSource
i=1
while i<len(sTemp)
sChar=mid(sTemp,i,1)
if ascw(sChar)<0 or ascw(sChar)>127 then
sTemp=replace(sTemp,sChar,"&#x"& hex(ascw(sChar)) &";")
end if
i=i+1
wend

WSToHtml=sTemp
end function

(mind the "ascw(sChar)<0" in there)

Hope that helps.

Regards,
Rudy Wieser

Mayayana

2019-04-30 12:51:14 UTC

Permalink

"Mayayana" <***@invalid.nospam> wrote
| Very weird problem.

Thanks to you both. I've never seen anything like this and
I don't think I ever would have guessed it. You've saved me
a lot of hair pulling. 25 BC, for instance, should just write
as % and 1/4 sign in English ANSI. Notepad will do that. Though
I see in the page code that it's not actually UTF-8. Rather
it's inserted as ▼ The bytes in the file are also that. All
ASCII range. Apparently IE is sending the converted version
and Textstream is balking at it. Which implies TS recognizes
UTF-8 but won't handle it! I suspect this may be yet another
child safety feature added by the scrrun authors.

I realize now that some time ago I wrote a conversion script
that loads HTML into IE, sets the document.charset to UTF-8,
then uses ADODB.Stream to convert it to windows-1252. Maybe
I'll play around with that and see what I get.

R.Wieser

2019-04-30 16:22:10 UTC

Permalink

Mayayana,

25 BC, for instance, should just write as % and 1/4
sign in English ANSI

Not quite. If you write it that way into a file you will never be able to
display the character the value origionally represented. A character in
multi-byte encoding /always/ starts with a byte with the highest bit set.
More set bits follow, depending on the length of the full value.

Regards,
Rudy Wieser

Mayayana

2019-04-30 17:01:57 UTC

Permalink

"R.Wieser" <***@not.available>

| > 25 BC, for instance, should just write as % and 1/4
| > sign in English ANSI
|
| Not quite. If you write it that way into a file you will never be able to
| display the character the value origionally represented.

I understand. What I meant was that all 256 values
correspond to ANSI characters. They should show as
ANSI characters. Not as a downward arrow, but as %
and 1/4. But in this case it's not UTF-8 multi-byte that's
the problem. Rather ▼ was used and IE is not
sending that through as UTF-8.

When I try to use Asc or AscW or AscB to see what IE
thinks is there, they all fail.

So I think the issue here is not a problem with UTF-8
in Textstream but rather I need to filter out &#[over 255];
before giving it to IE in the first place.

R.Wieser

2019-04-30 19:37:08 UTC

Permalink

Mayayana,

Post by Mayayana
I understand. What I meant was that all 256 values
correspond to ANSI characters.

:-) But thats the problem: You have /way/ more than 256 possible values in
that string. You're using VBScript, in it internally stores its text in
widechars (two bytes a piece), meaning you've got not (just) 256, but 65536
values on your hands.

Post by Mayayana
When I try to use Asc or AscW or AscB to see what
IE thinks is there, they all fail.

Thats odd. Could you show the code you're using to display the Asc values
of your webpage string ? And could you retry with a fully ASCII string
(just to see if its the code or the string thats causing the problems) ?

And you could also try to pull your string thru that function I posted -
just to see if it has got the same problem.

Post by Mayayana
So I think the issue here is not a problem with
UTF-8 in Textstream but rather I need to filter
out &#[over 255]; before giving it to IE in the first place.

Don't forget the under zero ones. You're working with signed ints. (Its
what bit me in the behind the first time :-) ).

And I'm not quite sure I follow. Wasn't the problem with saving to a
textfile ?

Regards,
Rudy Wieser

Mayayana

2019-04-30 21:06:09 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| :-) But thats the problem: You have /way/ more than 256 possible values in
| that string. You're using VBScript, in it internally stores its text in
| widechars (two bytes a piece), meaning you've got not (just) 256, but
65536
| values on your hands.
|
That's not relevant. It's only internal. A normal string
is single byte 1-255, wehther it's ANSI or UTF-8. The UTF-8
is just interpreted differently.

| > When I try to use Asc or AscW or AscB to see what
| > IE thinks is there, they all fail.
|
| Thats odd. Could you show the code you're using to display the Asc
values
| of your webpage string ? And could you retry with a fully ASCII string
| (just to see if its the code or the string thats causing the problems) ?
|
I didn't save the code, but basically it was
just a loop:

for i = 1 to len(s)
s2 = s2 & CStr(Asc(Mid(s, i, 1)))
next

It works on a normal string but none of them work
if I have ▼, give it to IE, then ask for it back
again. IE has apparently gone unicode-16 with it.

| Don't forget the under zero ones. You're working with signed ints. (Its
| what bit me in the behind the first time :-) ).
|
Negative numbers for &#? I've never seen it.
https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

| And I'm not quite sure I follow. Wasn't the problem with saving to a
| textfile ?
|
Yes, but saving it from IE. As you probably know,
when you ask IE for DOM data you get the IE
version. For instance, if you ask for the innerHTML
of a SPAN where you used B and /B you might get
back <STRONG> </STRONG>. IE apparently converts
the HTML to its own object hierarchy when it's loaded
into the DOM.

So the problem here is that a webpage with something
like ▼ seems to be getting the string converted,
part or all, to unicode. But I can't see what it's doing
because all attempts to look at the bytes fail.

I load a webpage, get the BODY content, put that
into a DIV in the HTA webpage into an HTA, edit as
desired, then ask IE for the innerHTML of the DIV in
the HTA where I put the BODY content. I then add back
the header and end and write it all to disk. That one 9660
seems to infect the whole thing. If I change it to  
for a non-breaking space, it loads and saves just fine.

R.Wieser

2019-05-01 08:01:01 UTC

Permalink

Mayayana,

Post by Mayayana
That's not relevant. It's only internal

But its that "only internal" what VBS/IE is working with, and why the ANSI
only output method throws up its hands when you ask it to write such content
to file. ...

Post by Mayayana
for i = 1 to len(s)
s2 = s2 & CStr(Asc(Mid(s, i, 1)))
next

Suggestion: Check what the output of that CStr is in every loop-step there.
Provide it the problem char plus a few before and after and see what the
"s2" strings becomes.

You might well see that the "s2" builds normally upto the problem char, and
than disappears altogether (as far as I remember adding a null to a string
doesn't work that well). Also try to see what you get when you replace
that CStr( with "|" & hex(

Post by Mayayana
IE has apparently gone unicode-16 with it.

Thats what I said, VBS/IE internally uses wide-character. :-)

Post by Mayayana
Negative numbers for &#? I've never seen it.

No, negative numbers when you AscW a character.

Post by Mayayana
As you probably know, when you ask IE for
DOM data you get the IE version.

No, I didn't. I've not used IE for quite a while now, and have never
bothered to look at its DOM.

But in that case, why don't you try to WriteFile that converted string
/before/ you give it to IE ? You know, trying to eliminate possible problem
points. If you're right the resulting file should be OK. But if that
file than again (still) is empty than the problem lies elsewhere ...

Post by Mayayana
That one 9660 seems to infect the whole thing.

Or any of the gazillion other multi-byte characters. Like forward and
backward single and double quotes, dashes of different lengths, a triplet of
dots, etc.

I often save, using FireFox, webpages (documentation) to file. Quite a few
of them show, when loaded into an ANSI-only editor, funny groups of symbols
sprinkled throut the text. Yes, those multi-byte characters. In other
words, delivering webpages as multi-byte (as opposed to ANSI with html
entities or encodings for the "special" characters) is not uncommon at all.

Regards,
Rudy Wieser

Mayayana

2019-05-01 12:37:55 UTC

Permalink

This post might be inappropriate. Click to display it.

R.Wieser

2019-05-01 14:00:59 UTC

Permalink

Mayayana,

But I've got a solution now. I just remove any HTML
entities over 255.

And possibly change the meaning of what is on the page ? Just imagine all
kinds of double-quotes disappearing (signifying where people talk about what
others said). :-|

I can't do that because the whole thing is based
on giving it to IE.

I did not mean permanently. Just as a test, so you can tell if the problem
already exists /before/ giving it to IE - and as such rule out IE as being
the culprit (as you presume).

No. You seem to be missing what I've been saying.
IE is turning ▼ into something else.

What "▼" Please ?

When I save that page you talked about in your first message and look at the
spot JJ indicated I see a 0xE2, 0x96, 0xBC sequence (which definitily looks
like a multi-byte encoded character), and no HTML entity like that (and no
value like 9660t either!) anywhere on the page.

In other words, I have no idea what you are talking about there - and I have
the idea you are rather confused about what happens where or how. Pardon
me the bruthish honesty.

But yes, That above three-byte sequence gets converted into a single
widechar byte (value ranging from 0 to 65535) and stored /for internal use/
by IE.

Part of the problem seems to be that you seem to think that what gets stored
in the DOM is exactly the same as what got loaded as HTML. Why should it
? Just to convert the above three-byte sequence or that "▼" HTML
entity into a character every time it needs to redraw the page ? That
would only slow down the whole thing, for no reason whatsoever.

Also, HTML elements are parsed and stored as DOM elements. That also means
thatif your HTML element contained superfluous spaces (between its
attributes) they will be gone when you "innerHTML" the DOM.

If you want the editor to save the HTML page exactly as it came from the
other side (forgetting your changes to it for the moment) than you /cannot/
use the DOM to edit it, and you will need to load the HTML page ito a buffer
of your own, edit /that/, and only than provide it to IE (to display) -
possibly, if you want to keep the full WYSIWYG effect, on every character
you type/remove.

... I almost wrote "can we get back to the problem please" (finding it and
than determining how to solve it), but you already mentioned you would just
throw any non-ANSI characters away. A bad move though (as I explained in
the above), but its not upto me.

Though to bad that you didn't take a peek at that VBS function I posted, as
that would replace the problematic chars with HTML-entities (just before
writing it to file), thereby keeping the integrity of your webpage intact.
But again, thats not upto me.

Regards,
Rudy Wieser

R.Wieser

2019-05-01 14:12:54 UTC

Permalink

Post by R.Wieser
In other words, I have no idea what you are talking about there - and I
have the idea you are rather confused about what happens where or how.
Pardon me the bruthish honesty.

Ah. I just realized: You are not loading the page itself, but are letting
IE do that for you. Only after that you take a peek at the DOM. In that
case IE already has done its conversions (like from multi-byte to
wide-character), so you can't save it /before/ giving it to IE (as its IE
doing the loading).

But that just means that you have to work with whatever IE gives you when
you do your "innerHTML" extraction of your DIV (which still is a perfect
representation of what you saw in it!). Not much choice there, right ?
:-)

But do yourself a favour and /convert/ those non-ANSI characters (instead of
just dropping them). If you don't you /will/ get into problems with pages
that will have gotten a different meaning than the origional. (just imagine
removing all end-of-line colons from a few lines of text ....)

Regards,
Rudy Wieser

Mayayana

2019-05-01 14:47:45 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| Ah. I just realized: You are not loading the page itself, but are
letting
| IE do that for you. Only after that you take a peek at the DOM. In that
| case IE already has done its conversions (like from multi-byte to
| wide-character), so you can't save it /before/ giving it to IE (as its IE
| doing the loading).
|

Yes.

| But that just means that you have to work with whatever IE gives you when
| you do your "innerHTML" extraction of your DIV (which still is a perfect
| representation of what you saw in it!). Not much choice there, right ?
| :-)
|

Right.

| But do yourself a favour and /convert/ those non-ANSI characters (instead
of
| just dropping them). If you don't you /will/ get into problems with
pages
| that will have gotten a different meaning than the origional. (just
imagine
| removing all end-of-line colons from a few lines of text ....)
|
All bytes are ANSI characters. You don't seem to be distinguishing
between UTF-8 bytes and 2-byte unicode. There's no problem loading
and saving with things like curly quotes, though they can show corrupted.
The problem is only with HTML entities over 255. If you look at the
original page code you'll see after SHARE is ▼. IE is converting
that to the unicode character (or something more quirky). If "▼"
is removed the page works fine. I'm not worried about losing upside
down triangle characters. :)

But I am actually converting much of the UTF-8, as I mentioned above.
I'm converting curly quotes to ", etc.

R.Wieser

2019-05-01 17:09:35 UTC

Permalink

Mayayana,

Post by Mayayana
All bytes are ANSI characters.

Sigh .... /which/ "bytes" please ? The ones used by IE, or the ones in
your file ?

For the latter ? Well, you rammed them down a method which does not accept
anything else, so I believe you. As for the former ? You're wrong. As I
already said, IE's DOM uses wide-chars, 16 bits a piece.

Also, you /still/ have no clue what multi-byte characters are (or why IE
converts them to wide-character ones), and seemingly could not care less.
If you would have read my explanation, my previous message or even just
googeled than you would have known that they "map" to the same range as your
ANSI characters, but should be looked at differently. So no, not all bytes
are ANSI characters.

Post by Mayayana
You don't seem to be distinguishing
between UTF-8 bytes and 2-byte unicode.

No, you're fully right. / Ofcourse/ I have zero clue what either of that
is. And that code I posted ? Thats doing absolutily nothing. Just ignore
it. As you obviously already did. Also ignore my suggestions of doing a
few simple tests on a short strings containing such a problematic character,
cause you know much better what is going on than anyone else, right ?

Currently I ask myself why I even thought it would be a good idea to try to
help you. :-((

Do not bother to respond. This thread is will be set to 'ignore'. Go
find someone else.

Regards,
Rudy Wieser

Mayayana

2019-05-04 19:26:47 UTC

Permalink

I got this PDF to HTML converter polished up, in case anyone
is curious:

https://www.jsware.net/jsware/scrfiles.php5#p2h

It uses the Poppler pdftohtml.exe tool to do an intial
conversion, then cleans up the result. There's also an editor
tool, to do minor edits of the HTML directly in a browser
window. The two together are working nicely to convert
text from PDF to a webpage with a left-side index in which
one can easily change font, paragraph width, etc.

Thanks to JJ for solving the strange puzzle of the unicode
booby trap.

2019-05-05 13:15:09 UTC

Permalink

Post by Mayayana
Thanks to JJ for solving the strange puzzle of the unicode
booby trap.

The problem is still unexplained, tho. :(

And that problem also means that with Unicode stream mode, it's still
impossible to write UCS2 encoded file containing those unwrittenable
characters.

Mayayana

2019-05-05 14:19:04 UTC

Permalink

"JJ" <***@vfemail.net> wrote

| The problem is still unexplained, tho. :(
|
| And that problem also means that with Unicode stream mode, it's still
| impossible to write UCS2 encoded file containing those unwrittenable
| characters.

It is an intriguing problem. Since IE converts the
actual text to its own object model, I'm guessing
the actual format of the data it sends is a hybrid
and not just plain text. Slightly analogous to RTF.
If you copy RTF or HTML to the clipbaord and
paste into Notepad, you'll get only the text part.
Maybe IE is designed with the assumption that
when you ask for DIV1.innerHTML you intend to
use it within the context of the DOM.... Just a
guess. Since it's not a legitimate string I don't
know how to inspect it. It does return 8 (string)
when tested with VarType. In a test I got 56 for Len
and 112 for LenB. That seems to indicate unicode.
Maybe it could be handled in VB. I've written VBS to
convert unicode to ANSI, but it's quirky. (It's very
handy, though. I discovered that Windows has a lot
of intelligence built into the conversion. For instance,
if I convert Sanskrit S with what looks like an accent
I get S in ANSI. Nice. For an English speaker all those
technical marks are just noise, anyway. So I can
use my script to convert academic Sanskrit to popular
English book version of Sanskrit easily.)

In any case, IE's quirk is not a problem for my purposes,
now that I know about it. I don't want unicode files. I
also don't have a problem with dropping out special
characters like an upside down triangle. I'm already
converting common UTF-8 characters like curly quotes,
non-breaking spaces, funky dashes, and o with umlaut
to ANSI equivalents. It's too much trouble to be switching
between encodings and it's completely unnecessary in
English.

That's the first I've heard of UCS-2. I had to look it up.
It appears to be an outdated term. According to Wikipedia,
Windows unicode was derived from UCS-2 but is not UCS-2.
That gets confusing. It used to be there was ANSI and
unicode in the Windows world. Now with the popularity
of UTF-8 there are lots of unicodes. But it seems to be
safe to refer to Windows unicode as unicode-16.