All site contents ©2024 Casey Connor unless otherwise noted.

Translating/stripping file names while copying from one filesystem to another

Last Updated: Monday, 30 July 2018

Stripping, sanitizing, or otherwise translating filenames when copying or moving from one file system to another with different rules for file names can be a pain. I had to copy many GB of files off a friend's dead Mac iBook G4. Macs use the HFS+ file system, and while ext3/4 actually allows a lot of the same odd characters, you don't see them as often in an OS where people use command lines, and anyway I was copying to a FAT32 partition. The other complication (given some of the other multi-step solutions out there) is that the HFS+ file system mounts as read-only if journaling is on, which it generally is. The only way I'm aware of to get around that is to turn it off in Mac OS, and I don't have access to that. (There may well be another way...) So, in this situation, barring some two-stage HFS+ -> ext4/ntfs -> rename -> FAT32 process, you can't rename the files in place, they have to be renamed as they are moved.

So, the challenge is to move the files off and rename them in the process. There were files with "'", "\", """, ":", and even a couple with "^M" codes in the file names. I also find spaces in file names annoying, so I stripped those out as well. Alphanumerics, single underbars, hyphen, dot, and $ were to be left alone.

Here is an ugly, and not perfect, way to do it. I'm sure there are special characters that it doesn't cover, and it doesn't handle "collisions": e.g. the file "a'b.txt" will be renamed to "a_b.txt" whether or not there is already a file called "a_b.txt" in the destination (it doesn't alter the source file, of course, since this works with a read-only file system source). Nonetheless, this worked very well for me. It'll get you 99% of the way there, and you can manually deal with the outliers and/or edit it to suit your needs.

This assumes you're moving whole directories, as opposed to individual files or groups of files within a directory. You can always prune out what you don't want after the move. It will replace all weird characters or series thereof with a single "_" in both directory and file names.

This will recursively copy all subdirectories and files.

First cd underneath the directory you want to move. E.g. you want to move "Desktop/" so you cd to "/media/mydisk/Users/myusername/" which contains "Desktop".

You must modify the following command to name both the directory you want to move (in two places) and the destination path (in two places, and you must use a trailing slash). In this example, the directory is "Desktop" and the destination path is "/my/destination/path/". The result will be a directory created at /my/destination/path/Desktop/ with the files (and directory structure) inside.

find Desktop/ -xtype d | awk '{print "mkdir /my/destination/path/" gensub(/_+/, "_", "g", gensub(/[^A-Za-z0-9_\-\/.\$]/, "_", "g", $0)) }' > /tmp/cmd.txt ; find Desktop/ -xtype f | sed "s/\\\\/\\\\\\\\/g;s/'/\\\\'/g;s/,/\\\\,/g;s/;/\\\\;/g;s/ /\\\\ /g;s/\&/\\\\&/g;s/(/\\\\(/g;s/)/\\\\)/g;s/:/\\\\:/g" | awk '{print "cp " $0 " /my/destination/path/" gensub(/_+/, "_", "g", gensub(/[^A-Za-z0-9_\-\/.\$]/, "_", "g", $0)) }' >> /tmp/cmd.txt

This generates a script file called /tmp/cmd.txt (we haven't copied anything yet).

Examine the file to make sure things look like they're working OK. the first part of the file is a bunch of "mkdir" commands, followed by the "cp" commands. If it's all good, then do:

source /tmp/cmd.txt

...and cross your fingers.

After running it I do something like:

find Desktop -type f | wc -l

...and compare to the file count in the destination directory to make sure everything was copied over. A "du -h" is probably a good sanity check as well.

Note that for an HFS Mac partition you may need to "sudo su" before running this command, as system permissions will probably lock you out of user directories and so forth, regardless of how you mount the partition.

So, how does it work?

First, we recreate the (stripped/translated) directory structure in the destination:

find Desktop/ -xtype d | awk '{print "mkdir /my/destination/path/" gensub(/_+/, "_", "g", gensub(/[^A-Za-z0-9_\-\/.\$]/, "_", "g", $0)) }' > /tmp/cmd.txt

The nested gensubs first strip out any characters that aren't in [A-Za-z0-9_\-\/.\$] (which is A through Z, a through z, 0 through 9, underline, hyphen, forward slash, dot, and $) and reduce any strings of resultant "_"'s to a single "_". The result is output as part of a series of "mkdir" commands to recreate the directory structure in the destination.

Next comes:

find Desktop/ -xtype f | sed "s/\\\\/\\\\\\\\/g;s/'/\\\\'/g;s/,/\\\\,/g;s/;/\\\\;/gs/ /\\\\ /g;s/\&/\\\\&/g;s/(/\\\\(/g;s/)/\\\\)/g;s/:/\\\\:/g" | ....

This nightmarish sed command escapes out all the strange characters (this list includes "\", "'", " ", "&", "(", ")", ";", ":", and ",". There is probably a much easier way to do this, but hey. There are also undoubtedly characters that should be in this list but aren't.

That's piped into:

.... | awk '{print "cp " $0 " /my/destination/path/" gensub(/_+/, "_", "g", gensub(/[^A-Za-z0-9_\-\/.\$]/, "_", "g", $0)) }' >> /tmp/cmd.txt

Which just generates the "cp" commands (with the escaped source path, and simultaneously cleaning up the destination path with the same nested gensubs as before) and appends to the same /tmp/cmd.txt file.

Comments/corrections welcome.