Why does a loop using instr() and mid() get slower on larger data ?

Discussion:

(too old to reply)

R.Wieser

2021-01-02 11:41:16 UTC

Hello all,

I'm using InStr() and mid() in a loop on large data strings. What I've
noticed is that the larger the datastring, the slower the process seems to
become. The time difference between results being returned for a few KByte
and even 5 MByte of data is rather noticable (and I've got a 25 MByte
file...).

Does anyone know what is causing it (and perhaps how to fix it) ?

Regards,
Rudy Wieser

Mayayana

2021-01-02 13:39:23 UTC

Permalink

"R.Wieser" <***@not.available> wrote
|
| I'm using InStr() and mid() in a loop on large data strings. What I've
| noticed is that the larger the datastring, the slower the process seems to
| become. The time difference between results being returned for a few
KByte
| and even 5 MByte of data is rather noticable (and I've got a 25 MByte
| file...).
|

For concatenation or adding to a string:

Memory allocation. When you add a character to a 25 MB
string you have to allocate new memory of 25 MB + 1 byte.
According to Matthew Curland, VB6 has a string cache to
work from, up to a point. After that point it gets slow.
Presumably WScript and other programs also set aside a minor
cache, but I don't know about that.

People end up doing things like writing string builder classes
to get around the problem. The fastest solution I know of in
VB is to allocate a string bigger than might be needed, then
point an array at it so that it can be treated as an array for
reading. Using the Mid *statement* then allows writing to
the string pretty much as fast as CopyMemory.

VBS doesn't have the Mid statement, only the function.
But it is extremely fast to do Join. So whenever I need to
build a string I'll do it that way:

For i = 1 to 10000
s = "something" & x & "something else"
A(i) = s
Next

s = Join(A, "")

For InStr:

If you see a lag just reading
a string or doing InStr then I don't know the problem,
assuming that you're not counting the time to allocate the
string. VBS is just generally slow. But I do find that searching
case-insensitive is much slower. So much so that if you have
a lot to do it's worth allocating a second string UCase(s)
and then search that: InStr(pt1, s, "SOMETHING", 0)

I just tried loading a 100 MB text file and looking for a
string near the end. It took .14 seconds. I'd expect a few
ms in VB, but VBS does a lot of clunky wrapping. Everything
has to be converted to/from variants. Direct memory access
is wrapped in layers. I just
assume anything big will be slow. On the other hand, if
you need to deal with 25 MB files enough to matter then
maybe something else is wrong. No *typical* text file should
need to be that big.

I have found a big difference with my server logs. I parse
server logs to extract all entries with a particular
date, so that I can create one file for each day. That's thousands
of InStr calls for each day. With a 5 MB file I can extract
and write files for 5 days in about 1-2 seconds. With a 300
MB server log I have to leave it to work. Probably more than
a minute for each day. When the logs get big I use InStrRev,
but it's still very slow. It's possible WScript is not keeping
that string in RAM. I don't know.

R.Wieser

2021-01-02 15:46:03 UTC

Permalink

Mayayana,
I know that that can get rather slow quicly. But I'm not adding anything to
the origional data, just instr() to find something and using mid() to copy a
small part following it (and than work on it).

Post by Mayayana
VBS doesn't have the Mid statement, only the function.

:-) Yup. Nice aint it ? You have to grab-and-store the (huge) strings
before and after what you want to replace and than glue them together with
the new string in between.

Post by Mayayana
If you see a lag just reading

That what I have, just while reading.

And the odd thing is that just finding a string in the first few lines slows
down depending on the size of the to-be-searched string. I have no
explanation for that, other than that it does /something/ involving the size
of the string. I would not know what though, as variant strings are
size-prefixed (they can have embedded NULs)

Post by Mayayana
But I do find that searching case-insensitive is much slower.

Thats to be expected. Thanks for reminding me though. I don't think that
the file contents are case sensitive, so I could just lcase() the lot of it
and have some speed gain.

Post by Mayayana
I just tried loading a 100 MB text file and looking
for a string near the end. It took .14 seconds

:-) To many variables. Loading and searching together ? ReadLine or
ReadAll ?

Post by Mayayana
On the other hand, if you need to deal with 25 MB files
enough to matter then maybe something else is wrong.
No *typical* text file should need to be that big.

It definitily is a typical textfile ... for an OpenStreetMap .OSM datafile
that is. :-)

I'm trying to convert such an dataset into a usable image. The first step
was to go thru all the "node" elements and plot their coordinates.

Just to check I rewrote the script to read the file line-by-line and see
what would happen. The end result is a few times faster than reading all of
it into memory and than have a go at it. I did not quite expect that ...

Rgeards,
Rudy Wieser

Mayayana

2021-01-02 20:26:36 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| > I just tried loading a 100 MB text file and looking
| > for a string near the end. It took .14 seconds
|
| :-) To many variables. Loading and searching together ? ReadLine or
| ReadAll ?
|

No, dummy. :) I loaded the whole file as a string.
Then I put a Timer call on either side of InStr, so that
I'd just be testing the actual InStr time.

| I'm trying to convert such an dataset into a usable image. The first
step
| was to go thru all the "node" elements and plot their coordinates.
|
| Just to check I rewrote the script to read the file line-by-line and see
| what would happen. The end result is a few times faster than reading all
of
| it into memory and than have a go at it. I did not quite expect that ...
|

That is surprising. I rarely read by line. But I
rarely deal with big files.

Open Street Map... I
don't suppose you've done anything with weather?
The US service, NOAA, made a big mess of their
radar maps in trying to update them. I'v been puttering
around with options. I can easily get a forecast in JSON,
but can't find a source of simple radar maps that show
where the rain is right now, and how hard it's coming
down.

R.Wieser

2021-01-02 21:11:19 UTC

Permalink

Mayayana,

Post by Mayayana
No, dummy. :)

Hey ! Watch who are you calling dummy, dummy. :-)

Post by Mayayana
I loaded the whole file as a string.
Then I put a Timer call on either side of InStr,
so that I'd just be testing the actual InStr time.

OK, understood.

And I noticed you mentioned 0.14 seconds on a 100MB file (I first thought
you said 14 sec). Here a repeated search on just a 5MB file gave me
between 2 and 5 results a second, or between 200 and 500 mSec. I would
have expected that with a much larger datastring it would become a multiple
of that.

Post by Mayayana
I don't suppose you've done anything with weather?

Other than viewing an rain probability image or two ? No, not really. Sorry.

Regards,
Rudy Wieser

Mayayana

2021-01-02 22:27:47 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| And I noticed you mentioned 0.14 seconds on a 100MB file (I first thought
| you said 14 sec). Here a repeated search on just a 5MB file gave me
| between 2 and 5 results a second, or between 200 and 500 mSec. I would
| have expected that with a much larger datastring it would become a
multiple
| of that.
|

I don't know. InStr is incredibly fast. I suspect the
wrapper bloat is much of the time. You have to tell
WScript to do it. WScript has to run the operation.
The string must be converted to BSTR. It all adds up.
And I suppose our systems may vary.

I just tried loading the bible. 4.29 MB of plain text.
I added "wordyyy" near the end and looked for that.
Position: 4,497,580 Time: 0
It didn't register as being enough time to measure.
But if you're doing multiple searches that introduces lots of other
variables.

So I tried this loop with the same Bible text:

pt2 = 1
pt1 = 1

x = 0
t1 = Timer
Do Until pt1 = 0
pt1 = InStr(pt2, s1, "John", 0)
pt2 = pt1 + 1
x = x + 1
Loop

t2 = timer
MsgBox x - 1 & vbCrLf & (t2 - t1)

It returns 140 instances, found in 0 seconds. Interesting.
So once all the overhead slop is dealt with, the looping
code seems to take almost no time at all.

R.Wieser

2021-01-03 07:58:33 UTC

Permalink

Mayayana,

Post by Mayayana
I don't know. InStr is incredibly fast. I suspect the
wrapper bloat is much of the time.

Yesterday, ofcourse just after I called it a night and switched my 'puter
off, I got a brainfart. Just now I tested it.

You mentioned caseless searching, and that is what I have been doing (and
forgot to mention, sorry). Changing that last InStr argument back to
casefull searching sped the whole process up considerably.

What I think that happens is that InStr() first converts both strings to
either upper or lower case, and only than does the search. While that
isn't a problem for relativly short strings, the conversion does take some
time when you throw 25 MB at it ...

... especially when you do that in a loop which finds over 40,000 .

Hence that reading the file line-by-line and than doing the caseless InStr()
sped the process up.

The whole process using the 25 MB string still takes several seconds to
complete though. Oh well.

Regards,
Rudy Wieser

Mayayana

2021-01-03 13:54:19 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| The whole process using the 25 MB string still takes several seconds to
| complete though. Oh well.
|

Did you try a UCase or LCase first and then
ReadAll? I don't see the point of reading by line
unless you expect to be quitting near the
beginning.

I don't know how case works. It wouldn't be
difficult to just search for "65 or 97" (A or a).
But either way it's a more complex search.

I imagine it might also depend on the length
of the search string and how many similarities
there are. If you have 25 MB of "orange is the
new black" and one instance of "orange is the
new green" near the end, I would expect all those
near matches to bog things down. But I've never
actually tested it.

R.Wieser

2021-01-03 15:01:10 UTC

Permalink

Mayayana,

Post by Mayayana
| The whole process using the 25 MB string still takes several
| seconds to complete though. Oh well.
Did you try a UCase or LCase first and then ReadAll?

Whut ? How do you suppose I would be able to do that ? I would think I
would need to read the data before being able to act upon it ....

Post by Mayayana
I don't see the point of reading by line

The drawback of reading everything into memory and doing an initial lcase or
ucase is that you have to assume that none of the files data is in a
mixed-case form thats important to the user. In my case the OSM file
contains name of streets, etc. Those I would like to keep the casing.

Post by Mayayana
I imagine it might also depend on the length
of the search string and how many similarities
there are.

Ofcourse! And as such not something I would think twice about.

Regards,
Rudy Wieser

Mayayana

2021-01-03 17:51:38 UTC

Permalink

"R.Wieser" <***@not.available> wrote

| > Did you try a UCase or LCase first and then ReadAll?
|
| Whut ? How do you suppose I would be able to do that ? I would think
I
| would need to read the data before being able to act upon it ....
|

I just meant using both of those methods as one
option to test.
s = TS.ReadAll
s = UCase(s)
Pt1 = Instr(s, "findstring")

| > I don't see the point of reading by line
|
| The drawback of reading everything into memory and doing an initial lcase
or
| ucase is that you have to assume that none of the files data is in a
| mixed-case form thats important to the user. In my case the OSM file
| contains name of streets, etc. Those I would like to keep the casing.
|

I've run into that before. In that case I just
have two:

s2 = UCase(s)
pt1 = InStr(s2, "MAPLE ST.")
if Pt1 > 0 then x = Mid(s, pt1, 9)

The speed difference is worth it. Ive written a lot of
scripts that do intensive work, such as converting
an HXS to CHM and heavily editing every HTML file
before doing the CHM compile.
I've found the two things that make a very notable
difference in speed are concatenating with arrays
and case-sensitive search. When I'm doing it for 100
files, and rebuilding each file, that makes a big difference.

R.Wieser

2021-01-03 18:26:04 UTC

Permalink

Mayayana,

Post by Mayayana
I just meant using both of those methods as one
option to test.
s = TS.ReadAll
s = UCase(s)

I would do the same. Its just that it looked like you suggested the
reverse order ...

Post by Mayayana
I've run into that before. In that case I just

I was also considereing using that. Didn't like to have a duplicate of the
same, rather large string though. But that may still be my "640 KB is
enough for everyone" memories talking though. Programming (small)
microcontrollers might also have something to do with it. :-)

Post by Mayayana
The speed difference is worth it.

I will most likely try it out, if only to see the difference.

Post by Mayayana
I've found the two things that make a very notable
difference in speed are concatenating with arrays
and case-sensitive search

:-) I've just ran into the latter, but can't say I've ever used the former.

Thanks.

Regards,
Rudy Wieser

Mayayana

2021-01-02 22:43:01 UTC

Permalink

One other note: You probably know this, but
the time measuring is not so accurate at very low levels.
In my experience VBScript timer is pretty good. But
if you're going to measure small things you need to
take multiple readings.

To mix things up even more, I just tried loading my
100 MB file, which is a list of IP addresses and various
other numbers. I did the same loop for "200". It found
641105 instances in 1.109375 seconds! But the whole
operation, including loading the file, was more like 5
or 6 seconds. Nevertheless, the actual InStr loops
seem to be something like 2 nanoseconds each, and
that includes reassigning the value of pt2, as well as
adding 1 to x. Wowza!

2021-01-02 19:47:18 UTC

Permalink

Post by R.Wieser
Hello all,
I'm using InStr() and mid() in a loop on large data strings. What I've
noticed is that the larger the datastring, the slower the process seems to
become. The time difference between results being returned for a few KByte
and even 5 MByte of data is rather noticable (and I've got a 25 MByte
file...).
Does anyone know what is causing it (and perhaps how to fix it) ?
Regards,
Rudy Wieser

Because InStr() is basically a loop which enumerates the characters within
the source string, until the matching substring is found.

The process will be fast if the matching substring happens to be at the
start or near the start of the source string. Otherwise, it can be a slow
process depending on the source string length.

If the contents of the source string is a plain text, string indexing may be
used to increase substring search. But it won't be noticably beneficial if
the substring search needs to be done less than 3 times (for each source
string), considering that the string index generation requires processing
time also.

R.Wieser

2021-01-02 20:47:55 UTC

Permalink

JJ,

Post by JJ
The process will be fast if the matching substring happens to
be at the start or near the start of the source string.

:-) I'm aware of that.

I ran the same instr() in a loop over a small and a large datastring, and I
got (lots of) results back at different speeds - for the same data (the
larger datastring just contained more of it).

Regards,
Rudy Wieser