The problem with Drive is that it’s really easy to have loads of files with the same name spread around in multiple folders on Drive. Heres how to get a handle on that, without having to write a bunch of code.

I’ll be using the cUseful library, specifically these techniques.

Here’s the key for the cUseful library, and it’s also on github, or below.

Mcbr-v4SsYKJP7JMohttAZyz3TLx7pV4j

The objective is to write a report to a spreadsheet of files that are duplicates. On my private version of this, I also have an algorithm for cleaning them up, but I’m not releasing that for now as I don’t want you to accidentally delete files you didn’t want to, so this version concentrates on reporting on duplicates.

It’s pretty slow to look through thousands of files organized by many folders, so you may need to do it in chunks by using the /path setting, as well as particular mimetypes, and perhaps by using the search terms too.

I’m using caching to avoid reading the folder structure too many times. It’s a big job to do that, so the first time you do it it will take a while. If your folder structure is not changing much, you can set the cache stay alive time to a higher number.

Settings

It starts with the settings, which look like this

Here’s an example output. Here I can see duplicate files names for scripts across folders.

Some notes on the Settings

  • drive.dapi – This is always DriveApp in this version. It can also use advanced Drive service , but strangely it is slower than DriveApp. I may implement different APIS in a future version.
  • drive.startFolder – the folder to start looking. This is a path like /documents/abc/. The drive root is /. If you find yourself running out of run time quota then you’ll have to do it in chunks by path specification.
  • drive.mime – use a standard mime type. If you leave it blank then all files will be considered
  • drive.recurse – If false, only the start folder is looked at. If true, all the subfolders are included.
  • drive.acrossFolders – if true then a file with the same name in a different folder is considered to be a duplicate.
  • drive.acrossMimes – if true then a file with the name but a different mime type is considered to be a duplicate.
  • drive.useCache – if true, then the folder structure is retrieved from cache (if available) rather than being built from drive
  • drive.cacheSeconds – how long to make cache last for. If not much change you can make this big.
  • drive.search – you can use any of the api search terms to further filter the search. If you have specified a value for drive.mime that is added to this search term.
  • report.sheetID – the sheet to write the results to.
  • report.sheetName – the sheet name to write the results to.
  • report.min – the minimum number of duplicates to be included in the report. The thing that slows it down is to find the parents of a file in order to figure out its path, so the first thing it does is to remove all files that haven’t got enough name repetitions to qualify for the report. Making this 1 will enumerate the drive, but will take a long time since every file will be fully processed, so be careful of quota blowing.

Example settings

Find duplicate file names ignoring mimetypes and folders, but starting at the given startFolder

Find all files containing test in the name of any mime type

The code

It’s on GitHub, or below, or copy of developing version here. You’ll need the settings namespace at the beginning of this post too.

For more like this see Google Apps Scripts Snippets
Why not join our forum, follow the blog or follow me on twitter to ensure you get updates when they are available.