|
Post by Rod on May 1, 2021 7:45:36 GMT
Yes, less cryptic. The key to the improvement was a change of technique. We had been taking the two lists unsorted. We then took the first item from list one and compared it to all items in list two. Then item two and so on. We did 60000*60000 comparisons. The new technique sorted both lists. Now the first entry in list one should match the first entry in list two. If the lists had been perfect we would make only 60000 comparisons. A stunning improvement in speed.
The complication is that the lists may not be perfect and we had to cater for list one being ahead of list two, behind list two or instep with list two. Fairly easy to check for these three conditions as we move through the file.
The overarching lesson is sorting. Many database tutorials and data handling examples show how to add delete or amend records. A few go on to show how to find a record but none show beginners the power of sorting. The classic example will simply trawl through the whole database to find or select records. In the real world this is unsustainable and sorting is an essential skill to acquire.
|
|
|
Post by carlgundel on May 1, 2021 16:16:47 GMT
Yes, less cryptic. The key to the improvement was a change of technique. We had been taking the two lists unsorted. We then took the first item from list one and compared it to all items in list two. Then item two and so on. We did 60000*60000 comparisons. The new technique sorted both lists. Now the first entry in list one should match the first entry in list two. If the lists had been perfect we would make only 60000 comparisons. A stunning improvement in speed. The complication is that the lists may not be perfect and we had to cater for list one being ahead of list two, behind list two or instep with list two. Fairly easy to check for these three conditions as we move through the file. The overarching lesson is sorting. Many database tutorials and data handling examples show how to add delete or amend records. A few go on to show how to find a record but none show beginners the power of sorting. The classic example will simply trawl through the whole database to find or select records. In the real world this is unsustainable and sorting is an essential skill to acquire. Yes, the algorithm is really important. LB Pro has a profiler which counts up how many milliseconds of time are spent on different lines of code. This is useful, but it's too easy to approach performance with blinders on so that you think the only thing that can be done is to try and tweak what the profiler indicates is slow. Sometimes you have to throw away most of the code and just do it in completely different way.
|
|
|
Post by toughdiamond on May 1, 2021 19:45:05 GMT
How does that SORT command manage to work so much faster than a bubblesort?
|
|
|
Post by tsh73 on May 1, 2021 21:17:06 GMT
Bubble sort is really naive way of sorting, taking n^2 time. Read up on QuickSort, it and a fiew others has smaller time (n*log(n)). JB has quicksort example (qsort.bas) in example folder, it really works faster.
|
|
|
Post by toughdiamond on May 2, 2021 2:00:03 GMT
Bubble sort is really naive way of sorting, taking n^2 time. Read up on QuickSort, it and a fiew others has smaller time (n*log(n)). JB has quicksort example (qsort.bas) in example folder, it really works faster. OK I too a look. I'm somewhat more familiar with the way QuickSort works now, though I can't say I know all about it. Judging by the numerous attempts on the Web to explain it, I'm not the only one who finds it a hard concept to get. I couldn't find anything about how anybody might decide how many reiterations to perform in order to be sure the sorting was complete (apparently the algorithm itself doesn't test whether or not the job is complete). Is QuickSort particularly similar to JB's Sort command?
|
|
|
Post by Rod on May 2, 2021 6:43:20 GMT
First and foremost the sort is run at near machine code speed, so a lot faster than our interpreted basic code. It is highly likely to be a quick sort routine. Since we compare higher or lower and there are two piles these piles exclude more and more comparisons as the sort progresses. So as it gets the list in order it has to make fewer and fewer comparisons to insert the item. That’s where the speed comes from, reducing the comparisons.
|
|
|
Post by B+ on May 2, 2021 16:23:31 GMT
|
|
|
Post by toughdiamond on May 2, 2021 23:08:11 GMT
They're interesting. I think I'd have to slow them down a little to keep up. I also noticed some salient ideas here: Further down that page is this gif illustration, which again was somewhat helpful, though again I'd have preferred it to run slower: qph.fs.quoracdn.net/main-qimg-d4e5d0a778dba725091d8317e6bac939This video is almost slow enough for me to grasp with one viewing, though it stops short of showing the entire sort.
I suspect it's one of those things where there's no sudden moment where the point is entirely understood, but rather a gradual continuum of getting increasingly used to it.
|
|
|
Post by toughdiamond on May 10, 2021 1:07:08 GMT
Well, the addition of paths seems to work To recap, problems were anticipated in the first working version of the program, exemplified by the following scenarios: (1) The two volumes presented for comparing each contain more than one file of the same name (residing in different subfolders), and one of those files is absent from one volume. (2) As above, but also another of those files (residing in a different subfolder) is absent from the other volume. The original program would correctly report one unique filename in scenario (1), but naturally it wouldn't report which subfolder it was missing from, because it deals exclusively with filenames, not paths. More of a concern is that in scenario (2), the program would report no unique files. The solution was to add the corresponding path to each filename in the array, for both volumes. After that, the unique files then get reported correctly in both scenarios, along with the relevent paths, which is obviously quite handy. A minor complication is that the two volumes could have different drive letters, which would spoil everything by causing all the files to be reported as unique. That can be fixed by removing the drive letter from the path. Easily done with mid$. That works fine when the volumes are two drives, such as a duplicated archive, but it can happen that a volume is contained in a folder on a drive, rather than being the sole occupant of the drive. For example, a flash drive's contents might get backed up onto a larger drive which already contained other files, in which case the files would normally be backed up into a new folder on the destination drive. In such cases it's not enough to remove just the drive letter. To get the program to work under those circumstances, what I did was to find the length of the pathname of the volume's containing folder, and to remove that number of characters from the left of every pathname in the array, using mid$. For example, if I have a flash drive backup in this containing folder: F:\SundryBackups\128_GB_FlashDriveNo4\The length of that is 38, so I remove the first 38 characters from each entry in the array. The flash drive itself might be: G:\Its length is 3, so I remove 3 characters from each entry in that array. The remainder of the path name is now identical for corresponding entries across each volume to be compared. It was pretty easy to get the program to do all that automatically, because the directory dumps I use always contain the path of the "root" folder first, so the first element in the array that holds the paths contains the string that needs removing. It's just a matter of doing this: length=len(f1$(1)) for k=1 to max1 f1$(k)=mid$(f1$(k),length) next k and then the same for f2$ and max2 I've given it a few tests, on drives and on folders, and so far the results have been perfect. It's hard to anticipate all possible real-life scenarios for which the program might be used. In its current state it will only show files that are completely missing, but it should be possible to include file sizes and dates/times, in case there's ever a need to detect alterations to files as well as omissions.
|
|
|
Post by toughdiamond on May 24, 2021 23:50:53 GMT
It's all going pretty well - I've got a version of the program that allows me to choose whether I want to run it on just the filenames, or to include the paths, dates, times, or sizes. It worked very nicely. Occasionally a pair of identical elements would get reported as unique to both volumes. Increasing the values of max1 and max2 (the number of elements) to be 1 greater than the total numbers of elements fixed it in all the cases where I'd seen it happen, as I'd rather expected it would, but caused another slight problem - in some cases the program reported an empty string as being a unique element in each volume. If you run the example program below you'll see that happen. It's "fixable" by adding
if f1$(i1)<>"" and if f2$(i2)<>""
to the lines that report the uniques.
I've included those as commented-out lines in the program - there are 4 of them. Substituting them for the existing lines naturally gets rid of the trouble, and can't do any harm, but I was just wondering if I've done it the best way. It's all working in every scenario I've tested, but it would be nice to know whether there's a better way or whether the fixes are likely to do the job fine as they are. Thanks in advance.
dim f1$(500000) dim f2$(500000)
f1$(1)=" 0 1.rtf" f1$(2)=" 51 1.bat" f1$(3)=" 51 1.bat.backup" f1$(4)=" 689 ProposedNewAdaption.txt" f1$(5)=" 719 OldNon-workingAdaption.txt" f1$(6)=" 996 ZroposedNewAdaption.rtf" f1$(7)=" 1,022 OldNon-workingAdaption.rtf" f1$(8)=" 6,832 uniques-AlphabeticallyOrdered.baj" f1$(9)=" 11,769,262 SavedAlpha.rtf"
f2$(1)=" 0 2.rtf" f2$(2)=" 51 1.bat.backup" f2$(3)=" 689 ProposedNewAdaption.txt" f2$(4)=" 719 OldNon-workingAdaption.txt" f2$(5)=" 996 ProposedNewAdaption.rtf" f2$(6)=" 1,022 OldNon-workingAdaption.rtf" f2$(7)=" 6,837 uniques-AlphabeticallyOrdered.baj" f2$(8)=" 11,769,262 AavedAlpha.rtf"
max1=10 max2=9
unq=0 i1=1 'record index for file 1 i2=1 'record index for file 2
while i1<max1 and i2<max2 while lower$(f1$(i1))<lower$(f2$(i2)) and i1<=max1 print "Unique to 1 (";i1;") ";f1$(i1):unq=unq+1 ' if f1$(i1)<>"" then print "Unique to 1 ";i1;" ";f1$(i1):unq=unq+1 i1=i1+1 wend while lower$(f1$(i1))=lower$(f2$(i2)) i1=i1+1 i2=i2+1 'did we reach the end if i1>max1 or i2>max2 then exit while wend if i1>max1 or i2>max2 then exit while while lower$(f1$(i1))>lower$(f2$(i2)) and i2<=max2 print "Unique to 2 (";i2;") ";f2$(i2):unq=unq+1 ' if f2$(i2)<>"" then print "Unique to 2 ";i2;" ";f2$(i2):unq=unq+1 i2=i2+1 wend wend while i1<=max1 print "Unique to 1 (";i1;") ";f1$(i1):unq=unq+1 ' if f1$(i1)<>"" then print "Unique to 1 ";i1;" ";f1$(i1):unq=unq+1 i1=i1+1 wend while i2<=max2 print "Unique to 2 (";i2;") ";f2$(i2):unq=unq+1 ' if f2$(i2)<>"" then print "Unique to 2 ";i2;" ";f2$(i2):unq=unq+1 i2=i2+1 wend print print "reported ";unq;" unique filenames"
end
|
|
|
Post by Rod on May 25, 2021 10:03:39 GMT
If it works and passes your tests it works, does not matter how you fixed it.
|
|
|
Post by toughdiamond on May 26, 2021 20:18:27 GMT
If it works and passes your tests it works, does not matter how you fixed it. OK, if there's no theoretical way of knowing whether or not it'll work every time, then trial-and-error is all I have. I suppose the algorithm is just too complex to be certain it'd always work. Looking around for information about Quicksort (not directly the issue here of course, because the elements appear to be correctly sorted in my example), it seems that attempts to formally verify it have been made, though I don't know that anybody has succeeded. stackoverflow.com/questions/66752412/can-spark-be-used-to-prove-that-quicksort-actually-sortsAnyway, I guess it's all rather academic. It's already impossible to be sure of anything on a Windows computer. I'll run a few more test jobs and call it done and dusted if they work. And don't get me wrong, I'm very pleased with it.
|
|