Any tips on speeding up a JustBASIC program?

Rod
Administrator

Posts: 679

Any tips on speeding up a JustBASIC program? Apr 23, 2021 18:46:25 GMT

Quote

Post by Rod on Apr 23, 2021 18:46:25 GMT

Given this is a directory dump from Windows there should be NO duplicate files. I did not want to raise that complication just yet as I see from the real data that there are indeed duplicate file names.

f1$(16)="%filename%.accurip"
f1$(17)="%filename%.accurip"
f1$(18)="%filename%.accurip"

These should not really exist. I was hoping testing would bring them to the fore. I am assuming that we have not run multiple directories together in one file list.

Last Edit: Apr 23, 2021 18:49:37 GMT by Rod

tsh73 Global Moderator Posts: 1,276	Any tips on speeding up a JustBASIC program? Apr 23, 2021 19:37:09 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by tsh73 on Apr 23, 2021 19:37:09 GMT as I understand this list is compilation from DIFFERENT nested folders, so duplications is possible.
	If you like piece of my code, go ahead and use it. I had my share of fun creating it - now it's free.

toughdiamond
Member in Training

Posts: 56

Any tips on speeding up a JustBASIC program? Apr 23, 2021 20:10:53 GMT

Quote

Post by toughdiamond on Apr 23, 2021 20:10:53 GMT

It is indeed common for real hard drives to contain files of the same name that reside in different folders. Your example looks to me as if the code has given the correct result - by adding a dot to one of a number of files called "C", there is now one unique file called "C." in one list and one unique file called "C" in the other. I guess the only thing the program doesn't tell us is the path of the unique file, which would be needed - though I think I could put that right without a lot of trouble. I've already got the paths in a 3rd column of my arrays, so it would be just a matter of adding that to the code that prints the name of the unique file. Have I understood your point?

Meanwhile, I've whittled down the 16,000-filename list to a more manageable 36 per list, which still causes problems for the uniques detector.

Here's list 1:

1oice only.docx
DSCF1108.JPG
DSCF1109.JPG
DSCF1110.JPG
DSCF1111.JPG
DSCF1112.JPG
DSCF1115.jpg
DSCF1115x.jpg
DSCF1116.jpg
DSCF1116x.jpg
DSCF1119.jpg
DSCF1119x.jpg
DSCF1121.JPG
DSCF1121x.jpg
DSCF1122.JPG
DSCF1123.JPG
DSCF1124.JPG
DSCF1125.jpg
DSCF1125x.jpg
DSCF1126.jpg
DSCF1126x.jpg
DSCF1127.jpg
DSCF1127x.jpg
DSCF1130.JPG
DSCF1130.jpg
DSCF1131.jpg
DSCF1131.JPG
DSCF1135.JPG
DSCF1135.jpg
DSCF1138.jpg
DSCF1138.JPG
DSCF1141.JPG
DSCF1141.jpg
DSCF1145.jpg
DSCF1145.JPG
VoiceAmerPsy2017.pdf

And here's list 2:

DSCF1108.JPG
DSCF1109.JPG
DSCF1110.JPG
DSCF1111.JPG
DSCF1112.JPG
DSCF1115.jpg
DSCF1115x.jpg
DSCF1116.jpg
DSCF1116x.jpg
DSCF1119.jpg
DSCF1119x.jpg
DSCF1121.JPG
DSCF1121x.jpg
DSCF1122.JPG
DSCF1123.JPG
DSCF1124.JPG
DSCF1125.jpg
DSCF1125x.jpg
DSCF1126.jpg
DSCF1126x.jpg
DSCF1127.jpg
DSCF1127x.jpg
DSCF1130.jpg
DSCF1130.JPG
DSCF1131.jpg
DSCF1131.JPG
DSCF1135.jpg
DSCF1135.JPG
DSCF1138.jpg
DSCF1138.JPG
DSCF1141.jpg
DSCF1141.JPG
DSCF1145.JPG
DSCF1145.jpg
voice only.docx
VoiceAmerPsy2017.pdf

I made just the one change - I altered "voice only.docx" to "1oice only.docx" in the first list. In my hands, not only does it give false positives, it also appears to send the program into an infinite loop. That's a rare event as judged by the 20 or so tests I've done on various directory dumps, but it does happen. Most of the tests showed false positives, some came out perfect.

Anyway, see if you can reproduce the fault. That will tell us a lot.

toughdiamond
Member in Training

Posts: 56

Any tips on speeding up a JustBASIC program? Apr 23, 2021 20:17:04 GMT

Quote

Post by toughdiamond on Apr 23, 2021 20:17:04 GMT

Ah, we cross-posted. As luck would have it, I don't see any duplicated filenames in those lists I've just posted - there are a few that are the same if case is ignored, but case isn't ignored. So we can worry about duplicated names later

toughdiamond
Member in Training

Posts: 56

Any tips on speeding up a JustBASIC program? Apr 23, 2021 20:30:25 GMT

Quote

Post by toughdiamond on Apr 23, 2021 20:30:25 GMT

Apr 23, 2021 19:37:09 GMT tsh73 said:

as I understand this list is compilation from DIFFERENT nested folders, so duplications is possible.

Yes, the original 16,000-plus directory dump is actually the entire contents of my computer's internal data partition, which contains lots of nested folders and probably lots of duplicated filenames - most of them will be different files with the same name, some may be identical files. That's why eventually I want to add the option of including the dates and sizes in the uniques-finding code, though with a bit of luck that won't be hard. I had that feature in my original program (the one that took hours to run) and it seemed to work. The directory dumps created by the batch file initially includes dates and sizes in the same line as the filename, so it was just a matter of not removing them during the parsing process.

Rod
Administrator

Posts: 679

Any tips on speeding up a JustBASIC program? Apr 23, 2021 20:50:37 GMT

Quote

Post by Rod on Apr 23, 2021 20:50:37 GMT

Right, a little more clarity. It is the dictionary sorting that is causing the problem. A ladder sort needs to be in strict order. We are getting the sort order a little incorrect because of the mixed case.

Probably two parts to the solution. If you are doing your own sort make it an ascii sort. if you use Just BASIC sort make the comparison case insensitive by forcing lower case compare. In either case the records should fall into order or we ignore the case.

Try this.

dim f1$(37)
dim f2$(37)

f1$(1)="1oice only.docx"
f1$(2)="DSCF1108.JPG"
f1$(3)="DSCF1109.JPG"
f1$(4)="DSCF1110.JPG"
f1$(5)="DSCF1111.JPG"
f1$(6)="DSCF1112.JPG"
f1$(7)="DSCF1115.jpg"
f1$(8)="DSCF1115x.jpg"
f1$(9)="DSCF1116.jpg"
f1$(10)="DSCF1116x.jpg"
f1$(11)="DSCF1119.jpg"
f1$(12)="DSCF1119x.jpg"
f1$(13)="DSCF1121.JPG"
f1$(14)="DSCF1121x.jpg"
f1$(15)="DSCF1122.JPG"
f1$(16)="DSCF1123.JPG"
f1$(17)="DSCF1124.JPG"
f1$(18)="DSCF1125.jpg"
f1$(19)="DSCF1125x.jpg"
f1$(20)="DSCF1126.jpg"
f1$(21)="DSCF1126x.jpg"
f1$(22)="DSCF1127.jpg"
f1$(23)="DSCF1127x.jpg"
f1$(24)="DSCF1130.JPG"
f1$(25)="DSCF1130.jpg"
f1$(26)="DSCF1131.jpg"
f1$(27)="DSCF1131.JPG"
f1$(28)="DSCF1135.JPG"
f1$(29)="DSCF1135.jpg"
f1$(30)="DSCF1138.jpg"
f1$(31)="DSCF1138.JPG"
f1$(32)="DSCF1141.JPG"
f1$(33)="DSCF1141.jpg"
f1$(34)="DSCF1145.jpg"
f1$(35)="DSCF1145.JPG"
f1$(36)="VoiceAmerPsy2017.pdf"

f2$(1)="DSCF1108.JPG"
f2$(2)="DSCF1109.JPG"
f2$(3)="DSCF1110.JPG"
f2$(4)="DSCF1111.JPG"
f2$(5)="DSCF1112.JPG"
f2$(6)="DSCF1115.jpg"
f2$(7)="DSCF1115x.jpg"
f2$(8)="DSCF1116.jpg"
f2$(9)="DSCF1116x.jpg"
f2$(10)="DSCF1119.jpg"
f2$(11)="DSCF1119x.jpg"
f2$(12)="DSCF1121.JPG"
f2$(13)="DSCF1121x.jpg"
f2$(14)="DSCF1122.JPG"
f2$(15)="DSCF1123.JPG"
f2$(16)="DSCF1124.JPG"
f2$(17)="DSCF1125.jpg"
f2$(18)="DSCF1125x.jpg"
f2$(19)="DSCF1126.jpg"
f2$(20)="DSCF1126x.jpg"
f2$(21)="DSCF1127.jpg"
f2$(22)="DSCF1127x.jpg"
f2$(23)="DSCF1130.jpg"
f2$(24)="DSCF1130.JPG"
f2$(25)="DSCF1131.jpg"
f2$(26)="DSCF1131.JPG"
f2$(27)="DSCF1135.jpg"
f2$(28)="DSCF1135.JPG"
f2$(29)="DSCF1138.jpg"
f2$(30)="DSCF1138.JPG"
f2$(31)="DSCF1141.jpg"
f2$(32)="DSCF1141.JPG"
f2$(33)="DSCF1145.JPG"
f2$(34)="DSCF1145.jpg"
f2$(35)="voice only.docx"
f2$(36)="VoiceAmerPsy2017.pdf"


sort f1$(,1,37
sort f2$(,1,37

max1=36 'number of records in array 1
max2=36 'number of records in array 2
i1=1    'record index for file 1
i2=1    'record index for file 2
while i1<max1 and i2<max2
    while lower$(f1$(i1))<lower$(f2$(i2)) and i1<=max1
        print "Unique to 1 ";f1$(i1)
        i1=i1+1
    wend
    while lower$(f1$(i1))=lower$(f2$(i2))
        i1=i1+1
        i2=i2+1
        'did we reach the end
        if i1>max1 or i2>max2 then exit while
    wend
    if i1>max1 or i2>max2 then exit while
    while lower$(f1$(i1))>lower$(f2$(i2)) and i2<=max2
        print "Unique to 2 ";f2$(i2)
        i2=i2+1
    wend
wend
while i1<=max1
    print "Unique to 1 ";f1$(i1)
    i1=i1+1
wend
while i2<=max2
    print "Unique to 2 ";f2$(i2)
    i2=i2+1
wend
end

Notice the additional array item, the sorting of that item and the forcing of lower case during all comparisons.

Last Edit: Apr 23, 2021 21:01:51 GMT by Rod

Rod Administrator Posts: 679	Any tips on speeding up a JustBASIC program? Apr 23, 2021 21:12:51 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Rod on Apr 23, 2021 21:12:51 GMT To be clear on duplicates, my code simply matches them and discounts them. only if there is one extra in any list will it get listed as an exception. The original task asked for unique files.

toughdiamond
Member in Training

Posts: 56

Any tips on speeding up a JustBASIC program? Apr 23, 2021 23:17:43 GMT

Quote

Post by toughdiamond on Apr 23, 2021 23:17:43 GMT

Apr 23, 2021 21:12:51 GMT Rod said:

To be clear on duplicates, my code simply matches them and discounts them. only if there is one extra in any list will it get listed as an exception. The original task asked for unique files.

Sure, and nobody could have anticipated the problem (if there turns out to be one) of multiple files of the same name in different folders on the same volume, when we started out. My original request was for a way of speeding up my

program that finds the names and paths of files that are unique to one or other of two disk drives.

It became clear that the best way of speeding it up is to switch to the ladder method.

Happy to confirm that your modification works, at least on the lists I posted. I use JustBASIC to do the sort - what I'm not clear about is whether or not your inclusion of lower$ in the new code means that forcing lower case elsewhere is still required? I would have thought your revision would be enough to nail that, but I thought I'd better ask.

I'll run some more tests and with a bit of luck it'll work a lot better now.

As for any problem about multiple instances of the same filename, I think the solution might be for me to simply include the pathnames with the filenames, which isn't hard to do. Otherwise we're asking the code to discriminate between identical strings (the plain filenames), which is impossible. So detecting strings unique to one or other volume should be all that the ladder is required to do, and with a bit of luck we're already there. It'll be quite an achievement if it works, because very few duplicate finders bother with uniques, and the ones I've seen that do are rather slow and give false positives.

toughdiamond
Member in Training

Posts: 56

Any tips on speeding up a JustBASIC program? Apr 24, 2021 7:24:24 GMT

Quote

Post by toughdiamond on Apr 24, 2021 7:24:24 GMT

Well, I've tried the new code on the 16057-filename volume, with a few altered names and a few deleted ones, and so far it's worked fine

Tried several of the scenarios that were causing those false positives yesterday, and they all sailed through perfectly.

Thanks for your help folks, especially Rod of course.

If it passes a few more tests, which I think it will, I'll take a look at this "same filename in different folders" notion, and with a bit of luck that'll work out well too.

toughdiamond
Member in Training

Posts: 56

Any tips on speeding up a JustBASIC program? Apr 29, 2021 21:01:46 GMT

Quote

Post by toughdiamond on Apr 29, 2021 21:01:46 GMT

Tested on volumes containing 58,143 files, which is the biggest collection I have so far. Renamed some, deleted others, unique-detection worked perfectly every time. I couldn't be happier with the results. Very grateful for the help.

So, "all" that remains is to try running it on paths+filenames instead of just paths, to see whether that will show which one of several files of the same name is missing. That and a bit of tidying up my own part of the program.

carlgundel
Administrator

Posts: 90

Any tips on speeding up a JustBASIC program? Apr 30, 2021 16:43:01 GMT

Quote

Post by carlgundel on Apr 30, 2021 16:43:01 GMT

Apr 29, 2021 21:01:46 GMT toughdiamond said:

Tested on volumes containing 58,143 files, which is the biggest collection I have so far. Renamed some, deleted others, unique-detection worked perfectly every time. I couldn't be happier with the results. Very grateful for the help.

So, "all" that remains is to try running it on paths+filenames instead of just paths, to see whether that will show which one of several files of the same name is missing. That and a bit of tidying up my own part of the program.

You managed to make it work? That's great. What sort of performance improvement do you see over your original code?

-Carl

toughdiamond
Member in Training

Posts: 56

Any tips on speeding up a JustBASIC program? Apr 30, 2021 17:46:32 GMT

Quote

Post by toughdiamond on Apr 30, 2021 17:46:32 GMT

Apr 30, 2021 16:43:01 GMT carlgundel said:

You managed to make it work? That's great. What sort of performance improvement do you see over your original code?

-Carl

Well, most of the credit is due to the other folks here. I just adapted their "ladder" code to fit my program.

It's hard to give a figure for the performance improvement, as the time my original program took was exponentially related to the number of files on the drives being analysed, while it would be linear for the new version. For that large volume I just tested, the new program took maybe a couple of minutes (most of that time was in loading the data), which is more than acceptable. I wouldn't dare try my old program on such a large volume. It would take many hours and the CPU would be running pretty hot, which isn't good for the life expectancy of the computer.

There's some doubt about how big the volumes processed could be before JB ran out of RAM. The exact limit seems imponderable because it depends on the number of files on the volume (the smaller the files are, the more can be on there) and on the lengths of the filenames (and pathnames when I add those to the program). But I have high hopes that all will be well on pretty much any volume up to at least 2TB.

Rod
Administrator

Posts: 679

Any tips on speeding up a JustBASIC program? Apr 30, 2021 18:33:54 GMT carlgundel and toughdiamond like this

Quote

Post by Rod on Apr 30, 2021 18:33:54 GMT

60k files averaging say 256bytes of path/file name is only 15Mb, well within Just BASICs 256Mb memory limit. Even then it can be broken into chunks.

Three minutes v three hours is pretty stunning.

I know this because in olden times I used to run an overnight update that finished sometime mid morning to lunch time. Folks were stunned one day arriving at work to find the update was done and dusted! Yep I had gone and read a book instead of inventing the process myself.

carlgundel
Administrator

Posts: 90

Any tips on speeding up a JustBASIC program? May 1, 2021 3:37:17 GMT

Quote

Post by carlgundel on May 1, 2021 3:37:17 GMT

Apr 30, 2021 18:33:54 GMT Rod said:

60k files averaging say 256bytes of path/file name is only 15Mb, well within Just BASICs 256Mb memory limit. Even then it can be broken into chunks.

Three minutes v three hours is pretty stunning.

I know this because in olden times I used to run an overnight update that finished sometime mid morning to lunch time. Folks were stunned one day arriving at work to find the update was done and dusted! Yep I had gone and read a book instead of inventing the process myself.

Three mins vs. three hours is a 60x speed improvement. Would anyone like to hazard a short description comparing the two approaches to implementation?

tsh73 Global Moderator Posts: 1,276	Any tips on speeding up a JustBASIC program? May 1, 2021 6:29:52 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by tsh73 on May 1, 2021 6:29:52 GMT how about going from n^2 comparing of unordered files to o(n)*log(n) sort + o(n) comparing of sorted arrays ?
	If you like piece of my code, go ahead and use it. I had my share of fun creating it - now it's free.

Any tips on speeding up a JustBASIC program?

Post by Rod on Apr 23, 2021 18:46:25 GMT

Post by tsh73 on Apr 23, 2021 19:37:09 GMT

Post by toughdiamond on Apr 23, 2021 20:10:53 GMT

Post by toughdiamond on Apr 23, 2021 20:17:04 GMT

Post by toughdiamond on Apr 23, 2021 20:30:25 GMT

Post by Rod on Apr 23, 2021 20:50:37 GMT

Post by Rod on Apr 23, 2021 21:12:51 GMT

Post by toughdiamond on Apr 23, 2021 23:17:43 GMT

Post by toughdiamond on Apr 24, 2021 7:24:24 GMT

Post by toughdiamond on Apr 29, 2021 21:01:46 GMT

Post by carlgundel on Apr 30, 2021 16:43:01 GMT

Post by toughdiamond on Apr 30, 2021 17:46:32 GMT

Post by Rod on Apr 30, 2021 18:33:54 GMT

Post by carlgundel on May 1, 2021 3:37:17 GMT

Post by tsh73 on May 1, 2021 6:29:52 GMT