czkawka

Instruction

Czkawka for now contains two independent frontends - the terminal and graphical interface which share the core module.

GUI GTK

GUI overview

The GUI is built from different pieces:

Translations

GUI is fully translatable.
For now at least 10 languages are supported(some was translated by computers)

Opening/Manipulating files

It is possible to open selected files by double-clicking on them.

To open multiple file just select desired files with CTRL key pressed and still when clicking this key, double-click at selected items with left mouse button.

To open folder containing selected file, just click twice on it with right mouse button.

To invert a selection of files, click on a file with the middle mouse button, and it will invert the selection of the other files in the same group.

Adding directories

By default, current path is loaded to included directory and excluded directories are filled with default paths.

It is possible to override this, by adding arguments when opening app e.g. czkawka_gui /home /usr --/home/rafal --/home/zaba which means that /home and /usr directories will be checked and /home/rafal and /home/zaba will be excluded.

When using additional command line arguments, saving at exit option become disabled, so this current info about directories will not be saved until user save it manually.

Both relative and absolute path are supported, so user can use both ../home and /home.

CLI

Czkawka CLI frontend is great to automate some tasks like removing empty directories.

To get general info how to use it just try to open czkawka_cli in console czkawka_cli

You should see a lot of examples how to use this app.

If you want to get more detailed info about certain tool, just add after its name -h or --help to get more details.

By default, all tools only write about results to console, but it is possible with specific arguments to delete some files/arguments or save it to file.

Config/Cache files

Currently, Czkawka stores few config and cache files on disk:

Editing bin files may cause showing strange crashes, so in case of having any, removing these files should help.
It is possible to modify files with JSON extension(may be helpful when moving files to different disk or trying to use cache file on different computer). To do this, it is required to enable in settings option to generate also cache json file. Next file can be changed/modified. By default, cache files with bin extension are loaded, but if it is missing(can be renamed or removed), then data from json file is loaded if exists.

Config files are located in this path:

Linux - /home/username/.config/czkawka
Mac - /Users/username/Library/Application Support/pl.Qarmin.Czkawka
Windows - C:\Users\Username\AppData\Roaming\Qarmin\Czkawka\config

Cache should be here:

Linux - /home/username/.cache/czkawka
Mac - /Users/Username/Library/Caches/pl.Qarmin.Czkawka
Windows - C:\Users\Username\AppData\Local\Qarmin\Czkawka\cache

Tips, Tricks and Known Bugs

Tools

Duplicate Finder

Duplicate Finder allows you to search for files and group them according to a predefined criterion:

Empty Files

Searching for empty files is easy and fast, because we only need to check the file metadata and its length.

Empty Directories

At the beginning, a special entry is created for each directory containing - the parent path (only if it is not a folder directly selected by the user) and a flag to indicate whether the given directory is empty (at the beginning each one is set to be potentially empty).

First, user-defined folders are put into the pool of folders to be checked.

Each element is checked to see if it is:

Example: there are 4 checked folders which may be empty /cow/, /cow/ear/, /cow/ear/stack/, /cow/ear/flag/.

The last folder contains a file, so that means that /cow/ear/flag is not empty and also all its parents - /cow/ear/ and /cow/, but /cow/ear/stack/ may still be empty.

Finally, all folders with the flag FolderEmptiness::Maybe are defaulted to empty.

Big Files

For each file inside the given path its size is read and then after sorting the list, e.g. 50 largest, files are displayed.

Temporary Files

Searching for temporary files only involves comparing their extensions with a previously prepared list.

Currently, files with these extensions are considered temporary files -

["#", "thumbs.db", ".bak", "~", ".tmp", ".temp", ".ds_store", ".crdownload", ".part", ".cache", ".dmp", ".download", ".partial"]

This only removes the most basic temporary files, for more I suggest to use BleachBit.

To find invalid symlinks we must first find symlinks.

After searching for them, you should check at which element it points to and if it does not exist, add this symlinks into the list of invalid symlinks, pointing to a non-existent path.

The second mode is to detect recursive symlink. Unfortunately, this mode does not work, and it displays when using it an error of a non-existent target element, but it is implemented by counting the jumps of the symlink and after exceeding a certain number (e.g. 20) it is considered that the given symlink is recursive.

Same Music

This is a mode to find identical music files through tags.

The number of tags to choose from is limited by an external library.

First, music files with one of the extensions [".mp3", ".flac", ".m4a"] are collected.

Then for each music file its tags are read.

Then, for each selected tag by which we want to search for duplicates, we perform the following steps:

Similar Images

It is a tool for finding similar images that differ e.g. in watermark, size etc.

The tool first collects images with specific extensions that can be checked - [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif", ".pnm", ".tga", ".ff", ".gif", ".jif", ".jfi", ".ico", ".webp", ".avif"].

Next cached data is loaded from file to prevent hashing twice the same file.
The cache which points to non-existing data, by default is deleted automatically.

Then a perceptual hash is created for each image which isn’t available in cache.

Cryptographic hash (used for example in ciphers) for similar inputs gives completely different outputs:
11110 ==> AAAAAB
11111 ==> FWNTLW
01110 ==> TWMQLA

Perceptual hash at similar inputs, gives similar outputs:
11110 ==> AAAAAB
11111 ==> AABABB
01110 ==> AAAACB

Computed hash data is then thrown into a special tree that allows to compare hashes using Hamming distance.

Next these hashes are saved to file, to be able to open images without needing to hash it more times.

Finally, each hash is compared with the others and if the distance between them is less than the maximum distance specified by the user, the images are considered similar and thrown from the pool of images to be searched.

It is possible to choose one of 5 types of hashes - Gradient, Mean, VertGradient, Blockhash, DoubleGradient.
Before calculating hashes usually images are resized with specific algorithm(Lanczos3, Gaussian, CatmullRom, Triangle, Nearest) to e.g. 8x8 or 16x16 image(allowed sizes - 8x8, 16x16, 32x32, 64x64), which allows simplifying later computations. Both size and filter can be adjusted in application.

Each configuration saves results to different cache files to save users from invalid results.

Some images broke hash functions and create hashes full of 0 or 255, so these images are silently excluded from end results(but still are saved to cache).

You can test each algorithm with provided CLI tool, just put to folder test.jpg file and run inside this command czkawka_cli tester -i

Faster compare option allows to only once compare results, so checking should work a lot of faster when using higher number of similarity.

Some tidbits:

Similar Videos

Tool works similar as Similar Images.

To work require FFmpeg, so it will show an error when it is not found in OS.
Also only checks files which are longer than 30s.
For now, it is limiting to check video files with almost equal length.

At first, it collects video files by extension (mp4, mpv, avi etc.).
Next each file is hashed. Implementation is hidden in library but looks that generate 10 images from this video and hash them with help of perceptual hash.

Such hashes are saved to cache to be able to use them later.

Next, with provided by user tolerance, they are compared to each other and group of similar hashes are returned.

Broken Files

This tool finds files which are corrupted or have an invalid extension.

At first app collects image and archive files(only this two types are supported now, but also I plan to support audio, but this is currently blocked by https://github.com/RustAudio/rodio/issues/349) and then these files are simply opened.

If an error happens when opening such file it means that this file is corrupted or unsupported.

Only some file extensions are handled, because I rely on external crates. Also, some false positives may be shown(e.g. https://github.com/image-rs/jpeg-decoder/issues/130) so always open file to check if it is really broken.

Bad Extensions

Mode allows to find files with content that not match with their extensions.

It works in this way:

In Proper Extension column, inside parentheses is visible extension guessed by Infer library, and outside there are extensions which have same mime type as guessed one. ABC