|
Post by toughdiamond on Apr 15, 2021 0:16:52 GMT
I've written a rudimentary JB program that finds the names and paths of files that are unique to one or other of two disk drives. Nothing fancy, it simply searches through two directory dumps line by line and reports anything that's in one but not the other. The feature is available in one or two proprietory "duplicate finders" but when I tried those, they gave too many false positives or proved too cumbersome to use for that purpose, so I thought I'd try writing one myself.
It works well enough, but if I try to use it to analyse the contents of large (i.e. 2TB) drives, which is the main purpose of the program, it will take many hours to complete the task. That's hardly surprising, because it takes each filename in turn from one directory listing and looks through the other directory listing, so if I have say 100,000 files on each drive then it will do 100,000 x 100,000 = 10,000,000,000 comparisons (actually that's an oversimplification and in practice it doesn't have to do quite that many, but you get the idea, the time taken for the program to run increases exponentially with the number of files on the drives, hence the problem).
I think I've tweaked the program enough to get the running time down to something reasonable, i.e. a few hours for the number of files I'm likely to want to process, but I was wondering if anybody knows of any tips for improving the speed? Essentially all it does is read everything into an array and do "IF a$(n) = b$(o) THEN......" I found that by splitting the directory listing into 2 parts and storing it in 2 separate arrays doubled the speed (because it only has to look at half the number of filenames per test), but I began to get into diminishing returns when I tried splitting it into more, and 4 parts is as far as it seems worth going. I've also found that using separate single-dimension arrays for the parts - w$(n), x$(n), etc. runs slightly faster than using one 2-dimensional array such as w$(n,1), w$(n,2) etc.
So my question is, does anybody have any ideas how I might speed things up further?
Thanks in advance :-)
|
|
|
Post by xxgeek on Apr 15, 2021 1:04:39 GMT
You may get better performance depending on your "write cashing" in Windows settings. A better explanation here - www.tenforums.com/tutorials/21904-enable-disable-disk-write-caching-windows-10-a.htmlYou "may" get better performance if you create the tkn file, run it instead of using the JB editor to run it, and give it a higher priority in the task manager. Depends on your system settings and what else is running along side your Just Basic program.If your system is a busy one you might get better performance with less interruptions as it does it,s work by using the higher priority. If you have lots of ram Another possible way is to create 2 ramdrives, copy files to each drive, then run your app on the files in the ramdrive, then copy the files back where they belong. Worth testing if you want speed. Using a batch file might help with the file copy to and from the ramdrives, or do that copy using JB code. I'll let the ace programmers help with adjusting any code that can speed the process up.. Edit - You should post your code. Others can see better what might be adjusted to help you out. Sorry, I misunderstood you. create a ramdrive, and put your dumpfiles in it to run your app on the dumpfiles in a ramdrive. Should cut the time down considerably
|
|
|
Post by B+ on Apr 15, 2021 1:42:10 GMT
If the files are in alpha order you can use binary search for duplicates, that would cut time considerable.
Warning JB has odd way of ordering strings! You might be better off setting up a hash table.
While on subject of Alpha order, I just thought of another time consuming issue: case
If case does not matter it might benefit you to list the files in all caps to save time converting or testing in upper(|lower) case and that might save you from JB goofiness, 2 birds one stone.
|
|
|
Post by toughdiamond on Apr 15, 2021 7:57:18 GMT
Thanks - I'm using Windows 7 but the tutorial will likely give me an idea what "write caching" is, and hopefully it's possible with Win7 by similar methods. Certainly it's worth trying the conversion to .tkn. As for increasing the priority, there's one thing that might complicate matters: I noticed quite early on that the program uses a lot of CPU, and the fan speed increases very noticeably. Not being too sure of the wisdom of running the CPU hot for a number of hours (with regard to ageing my computer), I've taken to reducing the amount of CPU available to JB with a utility called BES, which keeps the core reasonably cool, but naturally also slows it down, by a factor of about 1.8 at the current settings. I figure it might be worth the speed loss, since in spite of it I seem to have got the likely processing time down to something approaching reasonable, which is why I'm now looking for ways of making the code as efficient as possible, to get it down further still.
. . . .Given that I'm limiting the CPU load, would giving the program a higher priority just conflict with my efforts to keep the core temperature down? There aren't a lot of resources used by other processes on this computer unless I run other programs, which naturally would be nice, but I don't know much about what they'll do if I'm running my "program" in the background. It's a 2.1gHz AMD A4 3310MX APU, dual-core. I've got 8gb of RAM but sadly I'm using 32-bit Win7, which can only use 2.48gb of it. But I don't quite understand - the program loads the filenames into string arrays, and that's not what takes the bulk of the processing time. AFAIK that means the filenames are already in RAM before the hard work of searching them begins. Yes I should, and I will. But I'd best tidy it up first, as currently it's not going to be easy to figure out what it's up to. I'm almost entirely self-taught so my programming style is sometimes unconventional, naive, and downright scruffy - I only discovered JB recently, after many years of (mis-)using GwBASIC. I probably didn't explain it very clearly before. See above, about the slow part of the program being the search part which happens after the data has all been taken up into RAM.
|
|
|
Post by toughdiamond on Apr 15, 2021 8:54:17 GMT
If the files are in alpha order you can use binary search for duplicates, that would cut time considerable. Warning JB has odd way of ordering strings! You might be better off setting up a hash table. The two directory dumps aren't in alpha order. So far, I create them with a DOS command, for example: dir /og /s >directorydump.txt They're probably in alpha order per subfolder, but AFAIK there's no DOS switch to alphabetize the whole shebang as a single entity. I did try running a bubblesort in JB to order them, and it worked, but it took as long to run as the program itself, so overall it wasn't going to save any time. I'd been hoping that once it was alphabetized. What is a hash table? I can't think of any instances where case would matter. So what could I do to take advantage of that - convert everything using upper$ (or lower$), then what?
|
|
|
Post by Rod on Apr 15, 2021 10:12:20 GMT
Untested just my initial thoughts. Just BASIC has its own FILES command that fills an array instantly with your directory info. It also has a sort command that can sort sections of that array instantly (the files) so no bubble sort needed. Repeat for your other folder then you have two arrays in order. You should be able to jump through those rather than trawl through everything. As to cache and priority I am less sure. I know you can up the priority but if nothing else is running you have all the priority anyway. I think modern Windows pretty much caches all disc activity to ram. If I get some time I will try and code a test. But the sun is shining and I have model aeroplanes to fly.... This will give you an overview of the files command libertybasiccom.proboards.com/thread/1505/files-command
|
|
|
Post by toughdiamond on Apr 15, 2021 10:15:59 GMT
OK, for what it's worth, here's the listing of the program. Hope my coding style isn't too nebulous to follow. Frankly it even confuses me when I try to fathom it. But it does work, apart from the speed problem.
It's the section starting with the label [LookForDuplication] that's by far the most in need of speeding up, because that's the section that gets called as many times as there are files in the directory dump, and the section itself has to look through a lot of array elements every time it's called. Everything else happens fairly promptly.
Uniques.bas.rtf (8.01 KB)
|
|
|
Post by honky on Apr 15, 2021 10:32:32 GMT
Wouldn't it be more relevant to compare selected directories rather than disks?
|
|
|
Post by toughdiamond on Apr 15, 2021 11:44:00 GMT
Untested just my initial thoughts. Just BASIC has its own FILES command that fills an array instantly with your directory info. It also has a sort command that can sort sections of that array instantly (the files) so no bubble sort needed. Repeat for your other folder then you have two arrays in order. You should be able to jump through those rather than trawl through everything. Thanks, I didn't know those commands. FILES sounds good, but probably of limited use for the speed of my program, because loading the files as such doesn't take long even with the repeated LINE INPUT# command it uses. The real "long haul" part is comparing each filename in a very long list with each filename in another very long list, which takes ages even when it's all in the memory. . . .The SORT command, though, sounds like it could work wonders. If I can get the file list alphabetized then I should be able to write a routine to find matches a lot more efficiently. There's a complication - each entry in my array has the date and file size followed by the filename, so using the command directly would presumably only sort them by date. And I don't want to simply trim away the date and size because the program currently gives me the added option of including those in the search for identical files (in order to identify unique files), which is useful (if a file has a different size but the same name, it's probably not the same file). I guess it will be a matter of putting the full entries into one column of a 2-dimensional array, trimming them, putting the trimmed versions into the second column, and running SORT on those, which hopefully will reorder both columns. If that works, then "all" have left to figure out is exactly how to use it as a basis for a more efficient search. Yes it could be that it's only useful for making sure the BASIC doesn't slow down if I run other programs. When my program is the only thing running (apart from Windows and its associated processes), it typically takes half the available CPU. But I usually limit JB to about 36% to keep the CPU cool, so there's a lot of spare capacity for running other stuff, as long as Windows doesn't do anything silly. Well, it looks like I'll be opting for the SORT option anyway, unless my reasoning turns out to be faulty (wouldn't be the first time). Enjoy your flight :-)
|
|
|
Post by toughdiamond on Apr 15, 2021 12:28:51 GMT
Wouldn't it be more relevant to compare selected directories rather than disks? Well, the main purpose of the program is this: I have a large collection of music, videos, pictures and other files, and until recently their final resting place was a 2TB USB drive. I began to realise that my growing and increasingly valuable (to me) collection was in danger of being irrevocably lost if the drive ever broke down, so I got a second drive and proceeded to copy the stuff onto it. For some of the files I was silly enough to use Windows drag and drop, and it came up with error messages about some of the combined filenames and pathnames being "too long to copy," but without telling me the full names and paths of the files it had skipped, and without even giving me the option to save a log. So naturally I needed to find out which files were present on one volume but not the other. There are many directories on the drives and I don't know which ones contained the files that hadn't been copied. Another option would be to wipe the drive that doesn't contain all the files and then copy everything back onto it using something that works properly, like RoboCopy, but it's a lot of data and it would take a long time, especially as the drive has a rather slow write speed. I could also use RoboCopy with the /MIR switch to "synch" the drive to the original, but I'm not convinced it would be particularly quick and efficient. I've had strange results from RoboCopy where there ended up being more files on the copy than the original - that was on a 128gb flash drive. I still don't know what the extra files are. . . .So I started looking around for a way of finding skipped and superfluous files, and nothing seems to have been written that will do it properly, so I thought I'd try writing something myself. It could prove useful for any situation where I (or the copying utilities) made a mess of duplicating a large volume.
|
|
|
Post by B+ on Apr 15, 2021 16:59:25 GMT
Oh good, with a built in Files command (I didn't know) and I forgot about new Sort command also built in that should help tremendously with speed (that surely must work normal when comparing strings). Looks like the Files command will setup File name field by which you may sort. The question remains when comparing strings, JB has odd way of ordering them. Rod wrote up a string comparing routine some time ago but problem may be solved with all caps? You need to compare strings when doing Binary search for duplicate name. I can dig out some code for that if no one else has it handy. And using the old method of piping or redirecting dir listing into txt file could be done sorting by Filename first, just lookup the correct switches here: docs.microsoft.com/en-us/windows-server/administration/windows-commands/dirLooks like you can customize the sorting order on multiple fields with the o switch. I think I would try that first because you are almost there with that method already, so no learning curve for JB Files and Sort command and Windows routines likely to run fast! Get 2 sorted lists upon which to apply binary searches. Heck Windows probably has apps or commands for comparing folders and drives.
|
|
|
Post by Rod on Apr 15, 2021 17:35:44 GMT
You get 50% usage on the CPU because you are using a single core, 50% of the dual core capacity. But you cant use both cores with Just BASIC, or many other programs.
This is the native command set FILES. I did not need SORT since the directory is already in alpha order. If the file names differ in upper or lower case you can force a lower case comparison with lower$().
I have no idea if my search routine catches everything or will work in a more fragmented environment but it does find the missing files in my backup directory.
s=time$("ms") dim info1$(10,10) dim info2$(10,10) files "c:\basic\a scratchpad", info1$() files "e:\basic\a scratchpad", info2$() no1=val(info1$(0,0)) no2=val(info2$(0,0)) print "files found in 1 :";no1 print "files found in 2 :";no2 i1=1 i2=1 while i1<no1 or i2<no2 while info1$(i1,0)<info2$(i2,0) and i1<=no1 print "Unique to 1 ";info1$(i1,0) i1=i1+1 wend while info1$(i1,0)=info2$(i2,0) and i1<=no1 and i2<=no2 i1=i1+1 i2=i2+1 wend while info1$(i1,0)>info2$(i2,0) and i2<=no2 print "Unique to 2 ";info2$(i2,0) i2=i2+1 wend wend print "Completed in ms ";time$("ms")-s end
Output is
|
|
|
Post by tsh73 on Apr 15, 2021 18:01:02 GMT
please tell how much files you got?
|
|
|
Post by toughdiamond on Apr 15, 2021 22:42:22 GMT
please tell how much files you got? Right now, 58,141 files. So that would require about 3,300,000,000 comparisons if the program checked each filename in one volume against each filename in the other, though I would expect it to be about half that in practice, because once it's found a match for a file, it can move on to the next one. I'm also thinking of future requirements. If I ever fill the drive and need to run the program again, I calculated that it might contain about 80,000 files, which would require almost twice as many comparisons, and could be even more if they were smaller files. And they're already selling drives bigger than my 2tb drives. It's hard to know what the upper limit of the challenge for my program might be.
|
|
|
Post by B+ on Apr 15, 2021 23:23:42 GMT
Rod's code ought to cut the amount of comparing by vast amounts. I would guess around 2 * amount of files.
|
|