|
Post by Rod on Apr 23, 2021 18:46:25 GMT
Given this is a directory dump from Windows there should be NO duplicate files. I did not want to raise that complication just yet as I see from the real data that there are indeed duplicate file names.
f1$(16)="%filename%.accurip" f1$(17)="%filename%.accurip" f1$(18)="%filename%.accurip"
These should not really exist. I was hoping testing would bring them to the fore. I am assuming that we have not run multiple directories together in one file list.
|
|
|
Post by tsh73 on Apr 23, 2021 19:37:09 GMT
as I understand this list is compilation from DIFFERENT nested folders, so duplications is possible.
|
|
|
Post by toughdiamond on Apr 23, 2021 20:10:53 GMT
It is indeed common for real hard drives to contain files of the same name that reside in different folders. Your example looks to me as if the code has given the correct result - by adding a dot to one of a number of files called "C", there is now one unique file called "C." in one list and one unique file called "C" in the other. I guess the only thing the program doesn't tell us is the path of the unique file, which would be needed - though I think I could put that right without a lot of trouble. I've already got the paths in a 3rd column of my arrays, so it would be just a matter of adding that to the code that prints the name of the unique file. Have I understood your point?
Meanwhile, I've whittled down the 16,000-filename list to a more manageable 36 per list, which still causes problems for the uniques detector.
Here's list 1:
1oice only.docx DSCF1108.JPG DSCF1109.JPG DSCF1110.JPG DSCF1111.JPG DSCF1112.JPG DSCF1115.jpg DSCF1115x.jpg DSCF1116.jpg DSCF1116x.jpg DSCF1119.jpg DSCF1119x.jpg DSCF1121.JPG DSCF1121x.jpg DSCF1122.JPG DSCF1123.JPG DSCF1124.JPG DSCF1125.jpg DSCF1125x.jpg DSCF1126.jpg DSCF1126x.jpg DSCF1127.jpg DSCF1127x.jpg DSCF1130.JPG DSCF1130.jpg DSCF1131.jpg DSCF1131.JPG DSCF1135.JPG DSCF1135.jpg DSCF1138.jpg DSCF1138.JPG DSCF1141.JPG DSCF1141.jpg DSCF1145.jpg DSCF1145.JPG VoiceAmerPsy2017.pdf
And here's list 2:
DSCF1108.JPG DSCF1109.JPG DSCF1110.JPG DSCF1111.JPG DSCF1112.JPG DSCF1115.jpg DSCF1115x.jpg DSCF1116.jpg DSCF1116x.jpg DSCF1119.jpg DSCF1119x.jpg DSCF1121.JPG DSCF1121x.jpg DSCF1122.JPG DSCF1123.JPG DSCF1124.JPG DSCF1125.jpg DSCF1125x.jpg DSCF1126.jpg DSCF1126x.jpg DSCF1127.jpg DSCF1127x.jpg DSCF1130.jpg DSCF1130.JPG DSCF1131.jpg DSCF1131.JPG DSCF1135.jpg DSCF1135.JPG DSCF1138.jpg DSCF1138.JPG DSCF1141.jpg DSCF1141.JPG DSCF1145.JPG DSCF1145.jpg voice only.docx VoiceAmerPsy2017.pdf
I made just the one change - I altered "voice only.docx" to "1oice only.docx" in the first list. In my hands, not only does it give false positives, it also appears to send the program into an infinite loop. That's a rare event as judged by the 20 or so tests I've done on various directory dumps, but it does happen. Most of the tests showed false positives, some came out perfect.
Anyway, see if you can reproduce the fault. That will tell us a lot.
|
|
|
Post by toughdiamond on Apr 23, 2021 20:17:04 GMT
Ah, we cross-posted. As luck would have it, I don't see any duplicated filenames in those lists I've just posted - there are a few that are the same if case is ignored, but case isn't ignored. So we can worry about duplicated names later
|
|
|
Post by toughdiamond on Apr 23, 2021 20:30:25 GMT
as I understand this list is compilation from DIFFERENT nested folders, so duplications is possible. Yes, the original 16,000-plus directory dump is actually the entire contents of my computer's internal data partition, which contains lots of nested folders and probably lots of duplicated filenames - most of them will be different files with the same name, some may be identical files. That's why eventually I want to add the option of including the dates and sizes in the uniques-finding code, though with a bit of luck that won't be hard. I had that feature in my original program (the one that took hours to run) and it seemed to work. The directory dumps created by the batch file initially includes dates and sizes in the same line as the filename, so it was just a matter of not removing them during the parsing process.
|
|
|
Post by Rod on Apr 23, 2021 20:50:37 GMT
Right, a little more clarity. It is the dictionary sorting that is causing the problem. A ladder sort needs to be in strict order. We are getting the sort order a little incorrect because of the mixed case.
Probably two parts to the solution. If you are doing your own sort make it an ascii sort. if you use Just BASIC sort make the comparison case insensitive by forcing lower case compare. In either case the records should fall into order or we ignore the case.
Try this.
dim f1$(37) dim f2$(37)
f1$(1)="1oice only.docx" f1$(2)="DSCF1108.JPG" f1$(3)="DSCF1109.JPG" f1$(4)="DSCF1110.JPG" f1$(5)="DSCF1111.JPG" f1$(6)="DSCF1112.JPG" f1$(7)="DSCF1115.jpg" f1$(8)="DSCF1115x.jpg" f1$(9)="DSCF1116.jpg" f1$(10)="DSCF1116x.jpg" f1$(11)="DSCF1119.jpg" f1$(12)="DSCF1119x.jpg" f1$(13)="DSCF1121.JPG" f1$(14)="DSCF1121x.jpg" f1$(15)="DSCF1122.JPG" f1$(16)="DSCF1123.JPG" f1$(17)="DSCF1124.JPG" f1$(18)="DSCF1125.jpg" f1$(19)="DSCF1125x.jpg" f1$(20)="DSCF1126.jpg" f1$(21)="DSCF1126x.jpg" f1$(22)="DSCF1127.jpg" f1$(23)="DSCF1127x.jpg" f1$(24)="DSCF1130.JPG" f1$(25)="DSCF1130.jpg" f1$(26)="DSCF1131.jpg" f1$(27)="DSCF1131.JPG" f1$(28)="DSCF1135.JPG" f1$(29)="DSCF1135.jpg" f1$(30)="DSCF1138.jpg" f1$(31)="DSCF1138.JPG" f1$(32)="DSCF1141.JPG" f1$(33)="DSCF1141.jpg" f1$(34)="DSCF1145.jpg" f1$(35)="DSCF1145.JPG" f1$(36)="VoiceAmerPsy2017.pdf"
f2$(1)="DSCF1108.JPG" f2$(2)="DSCF1109.JPG" f2$(3)="DSCF1110.JPG" f2$(4)="DSCF1111.JPG" f2$(5)="DSCF1112.JPG" f2$(6)="DSCF1115.jpg" f2$(7)="DSCF1115x.jpg" f2$(8)="DSCF1116.jpg" f2$(9)="DSCF1116x.jpg" f2$(10)="DSCF1119.jpg" f2$(11)="DSCF1119x.jpg" f2$(12)="DSCF1121.JPG" f2$(13)="DSCF1121x.jpg" f2$(14)="DSCF1122.JPG" f2$(15)="DSCF1123.JPG" f2$(16)="DSCF1124.JPG" f2$(17)="DSCF1125.jpg" f2$(18)="DSCF1125x.jpg" f2$(19)="DSCF1126.jpg" f2$(20)="DSCF1126x.jpg" f2$(21)="DSCF1127.jpg" f2$(22)="DSCF1127x.jpg" f2$(23)="DSCF1130.jpg" f2$(24)="DSCF1130.JPG" f2$(25)="DSCF1131.jpg" f2$(26)="DSCF1131.JPG" f2$(27)="DSCF1135.jpg" f2$(28)="DSCF1135.JPG" f2$(29)="DSCF1138.jpg" f2$(30)="DSCF1138.JPG" f2$(31)="DSCF1141.jpg" f2$(32)="DSCF1141.JPG" f2$(33)="DSCF1145.JPG" f2$(34)="DSCF1145.jpg" f2$(35)="voice only.docx" f2$(36)="VoiceAmerPsy2017.pdf"
sort f1$(,1,37 sort f2$(,1,37
max1=36 'number of records in array 1 max2=36 'number of records in array 2 i1=1 'record index for file 1 i2=1 'record index for file 2 while i1<max1 and i2<max2 while lower$(f1$(i1))<lower$(f2$(i2)) and i1<=max1 print "Unique to 1 ";f1$(i1) i1=i1+1 wend while lower$(f1$(i1))=lower$(f2$(i2)) i1=i1+1 i2=i2+1 'did we reach the end if i1>max1 or i2>max2 then exit while wend if i1>max1 or i2>max2 then exit while while lower$(f1$(i1))>lower$(f2$(i2)) and i2<=max2 print "Unique to 2 ";f2$(i2) i2=i2+1 wend wend while i1<=max1 print "Unique to 1 ";f1$(i1) i1=i1+1 wend while i2<=max2 print "Unique to 2 ";f2$(i2) i2=i2+1 wend end
Notice the additional array item, the sorting of that item and the forcing of lower case during all comparisons.
|
|
|
Post by Rod on Apr 23, 2021 21:12:51 GMT
To be clear on duplicates, my code simply matches them and discounts them. only if there is one extra in any list will it get listed as an exception. The original task asked for unique files.
|
|
|
Post by toughdiamond on Apr 23, 2021 23:17:43 GMT
To be clear on duplicates, my code simply matches them and discounts them. only if there is one extra in any list will it get listed as an exception. The original task asked for unique files. Sure, and nobody could have anticipated the problem (if there turns out to be one) of multiple files of the same name in different folders on the same volume, when we started out. My original request was for a way of speeding up my It became clear that the best way of speeding it up is to switch to the ladder method. Happy to confirm that your modification works, at least on the lists I posted. I use JustBASIC to do the sort - what I'm not clear about is whether or not your inclusion of lower$ in the new code means that forcing lower case elsewhere is still required? I would have thought your revision would be enough to nail that, but I thought I'd better ask. I'll run some more tests and with a bit of luck it'll work a lot better now. As for any problem about multiple instances of the same filename, I think the solution might be for me to simply include the pathnames with the filenames, which isn't hard to do. Otherwise we're asking the code to discriminate between identical strings (the plain filenames), which is impossible. So detecting strings unique to one or other volume should be all that the ladder is required to do, and with a bit of luck we're already there. It'll be quite an achievement if it works, because very few duplicate finders bother with uniques, and the ones I've seen that do are rather slow and give false positives.
|
|
|
Post by toughdiamond on Apr 24, 2021 7:24:24 GMT
Well, I've tried the new code on the 16057-filename volume, with a few altered names and a few deleted ones, and so far it's worked fine Tried several of the scenarios that were causing those false positives yesterday, and they all sailed through perfectly. Thanks for your help folks, especially Rod of course. If it passes a few more tests, which I think it will, I'll take a look at this "same filename in different folders" notion, and with a bit of luck that'll work out well too.
|
|
|
Post by toughdiamond on Apr 29, 2021 21:01:46 GMT
Tested on volumes containing 58,143 files, which is the biggest collection I have so far. Renamed some, deleted others, unique-detection worked perfectly every time. I couldn't be happier with the results. Very grateful for the help.
So, "all" that remains is to try running it on paths+filenames instead of just paths, to see whether that will show which one of several files of the same name is missing. That and a bit of tidying up my own part of the program.
|
|
|
Post by carlgundel on Apr 30, 2021 16:43:01 GMT
Tested on volumes containing 58,143 files, which is the biggest collection I have so far. Renamed some, deleted others, unique-detection worked perfectly every time. I couldn't be happier with the results. Very grateful for the help. So, "all" that remains is to try running it on paths+filenames instead of just paths, to see whether that will show which one of several files of the same name is missing. That and a bit of tidying up my own part of the program. You managed to make it work? That's great. What sort of performance improvement do you see over your original code? -Carl
|
|
|
Post by toughdiamond on Apr 30, 2021 17:46:32 GMT
You managed to make it work? That's great. What sort of performance improvement do you see over your original code? -Carl Well, most of the credit is due to the other folks here. I just adapted their "ladder" code to fit my program. It's hard to give a figure for the performance improvement, as the time my original program took was exponentially related to the number of files on the drives being analysed, while it would be linear for the new version. For that large volume I just tested, the new program took maybe a couple of minutes (most of that time was in loading the data), which is more than acceptable. I wouldn't dare try my old program on such a large volume. It would take many hours and the CPU would be running pretty hot, which isn't good for the life expectancy of the computer. There's some doubt about how big the volumes processed could be before JB ran out of RAM. The exact limit seems imponderable because it depends on the number of files on the volume (the smaller the files are, the more can be on there) and on the lengths of the filenames (and pathnames when I add those to the program). But I have high hopes that all will be well on pretty much any volume up to at least 2TB.
|
|
|
Post by Rod on Apr 30, 2021 18:33:54 GMT
60k files averaging say 256bytes of path/file name is only 15Mb, well within Just BASICs 256Mb memory limit. Even then it can be broken into chunks.
Three minutes v three hours is pretty stunning.
I know this because in olden times I used to run an overnight update that finished sometime mid morning to lunch time. Folks were stunned one day arriving at work to find the update was done and dusted! Yep I had gone and read a book instead of inventing the process myself.
|
|
|
Post by carlgundel on May 1, 2021 3:37:17 GMT
60k files averaging say 256bytes of path/file name is only 15Mb, well within Just BASICs 256Mb memory limit. Even then it can be broken into chunks. Three minutes v three hours is pretty stunning. I know this because in olden times I used to run an overnight update that finished sometime mid morning to lunch time. Folks were stunned one day arriving at work to find the update was done and dusted! Yep I had gone and read a book instead of inventing the process myself. Three mins vs. three hours is a 60x speed improvement. Would anyone like to hazard a short description comparing the two approaches to implementation?
|
|
|
Post by tsh73 on May 1, 2021 6:29:52 GMT
how about going from n^2 comparing of unordered files to o(n)*log(n) sort + o(n) comparing of sorted arrays ?
|
|