Restoring WhatsApp media exif dates and fixing duplicates

WhatsApp ushered the world into the age of cross-platform multimedia instant messaging back in January 2009.

To be fair, multimedia instant messaging was already available on PCs as early as 1998 with the likes of Yahoo Messenger and MSN Messenger. I recall, with much amusement, having to rush home right after school so that I can continue chatting with my friends on my computer.

On mobile we had the SMS (Short Message Service) and later on, the notoriously expensive MMS (Multimedia Messaging Service). Even today, it costs a whopping S$0.35 (US$0.26) to send a single MMS in Singapore.

With the advent of WhatsApp and other cross-platform multimedia instant messaging services, it became so easy to send photos, videos, and audio to someone else that we've forgotten how annoying it was to do so just slightly over a decade ago.

The Age of Media

As much as I'd like to praise WhatsApp for their pivotal role in changing the way we connect with others, WhatsApp has also been the bane of all smartphones, especially so in the last few years as people generated more and more media per day.

Particularly detestable are "Good Morning" images that are sent by the same individuals in multiple group chats, resulting in media duplication and serves no purpose other than to take up precious storage space on my phone and thereby shortening its effective lifespan.

No amount of positivity will make up for the loss of storage for this

Storage woes

I'm sure you may have experienced at least once where your phone runs out of storage space and you check what's taking so much storage, you find something like the following:

In case you're wondering, this list of apps is sorted in descending order of storage utilization

Despite the best of my efforts at trying to be prudent with my WhatsApp media downloads, WhatsApp still takes up 18GB of storage. The next largest app is Google at 0.9GB, occupying over 19 times less space, other apps are even smaller than that! For a budget smartphone with around 32GB of storage available, this is pretty much a death sentence.

Backing up to free storage

As much as I'd like to discard my WhatsApp media, life as we know it has become so intertwined with WhatsApp that discarding it would be akin to discarding a decade of precious moments with my family and friends.

This becomes especially important if group chats contain photos of videos of loved ones who have passed on. In my case, it was my grandfather, who has recently succumbed to COVID-19 in August 2021, and I'd like to keep all of my family's moments with him, recorded in WhatsApp, for as long as I live and perhaps one day peruse it with my children and speak of his strength and bravery.

To ensure the longevity of not just my phone but those of my parents, aunts and uncles, I had to back up the media somewhere safe on my NAS and delete the media from the WhatsApp folder on all phones.

Photoview
Photoview is a simple and user-friendly Photo Gallery for self-hosted personal servers. It is made for photographers and aims to provide an easy and fast way to navigate directories, with thousands of high resolution photos.

At the same time, I had to ensure that they can be viewed without too much inconvenience so I decided to host an instance of Photoview for my family to browse their own WhatsApp media backup.

Where is my Exif?!

I imported the WhatsApp photos and videos into Photoview and it was at this point I realized that something was amiss.

All the photos were shown to be taken today (October 17, 2021).
There was no order whatsoever differentiating photos taken in 2013 with photos taken in 2021.

Checking one of the WhatsApp photos on Preview in MacOS reveals that Exif data is missing, shown by the missing Exif tab in the Inspector window.

Where is my Exif??

A quick search on the internet confirmed my fears, that WhatsApp not only compresses the images, but also discards all Exif data in a bid to save storage space.

Well thanks WhatsApp for being so considerate about my phone's storage to the point where you discard such precious metadata to save a couple of bytes.

But all hope is not lost yet.

Restoring Exif data

Looking closely at the file naming scheme of WhatsApp media, I realized that the date is still preserved in the filename. For example, in one of the images: IMG-20191008-WA0009.jpeg, we can see that the photo was taken in 2019-10-08 with WhatsApp WA, and was the 9th photo of the day 0009. This means that the file naming scheme is as follows:

IMG-{YYYYMMDD}-WA{monotonic-integer}.{jpg/jpeg}

With that knowledge in hand, I wrote a python script that accepts a path as an argument and for each file in the path:

  1. Checks if the file is an image (.jpg or .jpeg)
  2. Checks if the image is created by WhatsApp (-WA- in filename)
  3. Parses the filename for the date (-YYYYMMDD- in filename)
  4. Reads the Exif data of the image (if it still exists)
  5. Adds the date into the Exif data of the image (Set to 00:00hrs)
  6. Writes the Exif data into the image

After running the script, the Exif tab returns in the Inspector window, with the Date Time Original field.

Oh hello Exif! It's nice to have you back, or at least a part of you.

What about video dates?

For video files, since it does not use Exif as its metadata store, Photoview checks for the created/modified dates and uses them as the video date. Unfortunately, backing up and restoring WhatsApp media tends to butcher those fields so that bit of information is lost as well.

To fix this issue for videos, in the same python script, I wrote the following logic that handles video files separately:

  1. Checks if the file is a video (.mp4 or .3gp)
  2. Checks if the video is created by WhatsApp (-WA- in filename)
  3. Parses the filename for the date (-YYYYMMDD- in filename)
  4. Sets the file creation and modification dates to the date parsed in the filename

Removing duplicates

Taking it one step further, since I'm already knee deep in WhatsApp media archival, I decided to scratch an itch that I had for a long long time: duplication.

To address the issue of the gazillions of duplicated "Good Morning" images, I wrote another Python script that takes a path as an argument, searches for and deletes duplicated files.

Deletion logic

When choosing which files to keep, the script will keep the file with the:

  1. Shortest filename
  2. Smallest filename value

Filename value here refers to the sorting of filenames by alphabetical order.

For example, when considering 3 identical "Good Morning" images, sent over 3 different days, with 1 image somehow duplicated:

  1. IMG-20191008-WA0009.jpeg
  2. IMG-20191010-WA0002.jpeg
  3. IMG-20191013-WA0001.jpeg
  4. IMG-20191013-WA0001 (2).jpeg

The script would preserve IMG-20191008-WA0009.jpeg for the following reasons:

  1. It has the smallest filename value as it has the earliest date in the filename
  2. It has the shortest filename, thereby discarding the image with the (2) suffix

Duplication check logic

In case you're interested, this is the algorithm that I employed for finding duplicates:

  1. Checks and remembers the file sizes of all files (~30k)
  2. Finds files with the same file size (~2k)

Among files with the same file size:

  1. Computes the sha1 hash of the first 1kB
  2. Finds files with the same sha1 hash for the first 1kB (~900)

Among files with the same sha1 hash for the first 1kB:

  1. Computes the sha1 hash for the entire file
  2. Finds files with the same sha1 hash (~800)

Design notes

  • I chose sha1 as it is the fastest hashing algorithm that I know of that's available in Python.
  • The reason for this staged approach as opposed to straight up computing the hash for all files is to reduce wasteful compute.
  • It is easy to prove that files are different but much harder to prove that files are identical

Try it yourself!

If you, like me, face issues with archival of your WhatsApp media, here's the source code for the python scripts that I wrote, detailed above.

GitHub - ikaruswill/whatsapp-media-tools: Restore WhatsApp Media exif dates from filenames and delete duplicated images for archival
Restore WhatsApp Media exif dates from filenames and delete duplicated images for archival - GitHub - ikaruswill/whatsapp-media-tools: Restore WhatsApp Media exif dates from filenames and delete du...

They're written in Python 3, so this script assumes that you already have Python 3 installed in your system.

How to use

Clone the repository

$ git clone https://github.com/ikaruswill/whatsapp-media-tools.git
$ cd whatsapp-media-tools

Create and activate the virtual environment

$ python3 -m venv ./venv
$ source venv/activate

Install dependencies in the virtual environment

$ pip install -r requirements.txt

Run the script as detailed in the README.md and profit!

Hopes for the future

I may be optimistic, but I'm still looking forward to the day when WhatsApp backup management becomes a painless process.

Off the top of my head, I can think of several ways forward:

  1. Cloud storage of WhatsApp media (Facebook won't pay I guess)
  2. Better backup management logic whereby the phone maintains a working set of 2 years of media. Older media are archived permanently in Google Drive, retrievable on demand.
  3. Built-in deduplication of media in WhatsApp
  4. Preservation of Exif data (or at least the date and GPS coordinates please!)

Here's hoping that one day, the headache of WhatsApp media storage will go away with a whimper.