Restoring WhatsApp media exif dates and fixing duplicates
WhatsApp ushered the world into the age of cross-platform multimedia instant messaging back in January 2009.
To be fair, multimedia instant messaging was already available on PCs as early as 1998 with the likes of Yahoo Messenger and MSN Messenger. I recall, with much amusement, having to rush home right after school so that I can continue chatting with my friends on my computer.
On mobile we had the SMS (Short Message Service) and later on, the notoriously expensive MMS (Multimedia Messaging Service). Even today, it costs a whopping S$0.35 (US$0.26) to send a single MMS in Singapore.
With the advent of WhatsApp and other cross-platform multimedia instant messaging services, it became so easy to send photos, videos, and audio to someone else that we've forgotten how annoying it was to do so just slightly over a decade ago.
The Age of Media
As much as I'd like to praise WhatsApp for their pivotal role in changing the way we connect with others, WhatsApp has also been the bane of all smartphones, especially so in the last few years as people generated more and more media per day.
Particularly detestable are "Good Morning" images that are sent by the same individuals in multiple group chats, resulting in media duplication and serves no purpose other than to take up precious storage space on my phone and thereby shortening its effective lifespan.
Storage woes
I'm sure you may have experienced at least once where your phone runs out of storage space and you check what's taking so much storage, you find something like the following:
Despite the best of my efforts at trying to be prudent with my WhatsApp media downloads, WhatsApp still takes up 18GB of storage. The next largest app is Google at 0.9GB, occupying over 19 times less space, other apps are even smaller than that! For a budget smartphone with around 32GB of storage available, this is pretty much a death sentence.
Backing up to free storage
As much as I'd like to discard my WhatsApp media, life as we know it has become so intertwined with WhatsApp that discarding it would be akin to discarding a decade of precious moments with my family and friends.
This becomes especially important if group chats contain photos of videos of loved ones who have passed on. In my case, it was my grandfather, who has recently succumbed to COVID-19 in August 2021, and I'd like to keep all of my family's moments with him, recorded in WhatsApp, for as long as I live and perhaps one day peruse it with my children and speak of his strength and bravery.
To ensure the longevity of not just my phone but those of my parents, aunts and uncles, I had to back up the media somewhere safe on my NAS and delete the media from the WhatsApp folder on all phones.
At the same time, I had to ensure that they can be viewed without too much inconvenience so I decided to host an instance of Photoview for my family to browse their own WhatsApp media backup.
Where is my Exif?!
I imported the WhatsApp photos and videos into Photoview and it was at this point I realized that something was amiss.
All the photos were shown to be taken today (October 17, 2021).
There was no order whatsoever differentiating photos taken in 2013 with photos taken in 2021.
Checking one of the WhatsApp photos on Preview in MacOS reveals that Exif data is missing, shown by the missing Exif tab in the Inspector window.
A quick search on the internet confirmed my fears, that WhatsApp not only compresses the images, but also discards all Exif data in a bid to save storage space.
Well thanks WhatsApp for being so considerate about my phone's storage to the point where you discard such precious metadata to save a couple of bytes.
But all hope is not lost yet.
Restoring Exif data
Looking closely at the file naming scheme of WhatsApp media, I realized that the date is still preserved in the filename. For example, in one of the images: IMG-20191008-WA0009.jpeg
, we can see that the photo was taken in 2019-10-08
with WhatsApp WA
, and was the 9th photo of the day 0009
. This means that the file naming scheme is as follows:
IMG-{YYYYMMDD}-WA{monotonic-integer}.{jpg/jpeg}
With that knowledge in hand, I wrote a python script that accepts a path as an argument and for each file in the path:
- Checks if the file is an image (.jpg or .jpeg)
- Checks if the image is created by WhatsApp (
-WA-
in filename) - Parses the filename for the date (
-YYYYMMDD-
in filename) - Reads the Exif data of the image (if it still exists)
- Adds the date into the Exif data of the image (Set to 00:00hrs)
- Writes the Exif data into the image
After running the script, the Exif tab returns in the Inspector window, with the Date Time Original
field.
What about video dates?
For video files, since it does not use Exif as its metadata store, Photoview checks for the created/modified dates and uses them as the video date. Unfortunately, backing up and restoring WhatsApp media tends to butcher those fields so that bit of information is lost as well.
To fix this issue for videos, in the same python script, I wrote the following logic that handles video files separately:
- Checks if the file is a video (.mp4 or .3gp)
- Checks if the video is created by WhatsApp (
-WA-
in filename) - Parses the filename for the date (
-YYYYMMDD-
in filename) - Sets the file creation and modification dates to the date parsed in the filename
Removing duplicates
Taking it one step further, since I'm already knee deep in WhatsApp media archival, I decided to scratch an itch that I had for a long long time: duplication.
To address the issue of the gazillions of duplicated "Good Morning" images, I wrote another Python script that takes a path as an argument, searches for and deletes duplicated files.
Deletion logic
When choosing which files to keep, the script will keep the file with the:
- Shortest filename
- Smallest filename value
Filename value here refers to the sorting of filenames by alphabetical order.
For example, when considering 3 identical "Good Morning" images, sent over 3 different days, with 1 image somehow duplicated:
IMG-20191008-WA0009.jpeg
IMG-20191010-WA0002.jpeg
IMG-20191013-WA0001.jpeg
IMG-20191013-WA0001 (2).jpeg
The script would preserve IMG-20191008-WA0009.jpeg
for the following reasons:
- It has the smallest filename value as it has the earliest date in the filename
- It has the shortest filename, thereby discarding the image with the (2) suffix
Duplication check logic
In case you're interested, this is the algorithm that I employed for finding duplicates:
- Checks and remembers the file sizes of all files (~30k)
- Finds files with the same file size (~2k)
Among files with the same file size:
- Computes the
sha1
hash of the first 1kB - Finds files with the same
sha1
hash for the first 1kB (~900)
Among files with the same sha1
hash for the first 1kB:
- Computes the
sha1
hash for the entire file - Finds files with the same
sha1
hash (~800)
Design notes
- I chose
sha1
as it is the fastest hashing algorithm that I know of that's available in Python. - The reason for this staged approach as opposed to straight up computing the hash for all files is to reduce wasteful compute.
- It is easy to prove that files are different but much harder to prove that files are identical
Try it yourself!
If you, like me, face issues with archival of your WhatsApp media, here's the source code for the python scripts that I wrote, detailed above.
They're written in Python 3, so this script assumes that you already have Python 3 installed in your system.
How to use
Clone the repository
$ git clone https://github.com/ikaruswill/whatsapp-media-tools.git
$ cd whatsapp-media-tools
Create and activate the virtual environment
$ python3 -m venv ./venv
$ source venv/activate
Install dependencies in the virtual environment
$ pip install -r requirements.txt
Run the script as detailed in the README.md and profit!
Hopes for the future
I may be optimistic, but I'm still looking forward to the day when WhatsApp backup management becomes a painless process.
Off the top of my head, I can think of several ways forward:
- Cloud storage of WhatsApp media (Facebook won't pay I guess)
- Better backup management logic whereby the phone maintains a working set of 2 years of media. Older media are archived permanently in Google Drive, retrievable on demand.
- Built-in deduplication of media in WhatsApp
- Preservation of Exif data (or at least the date and GPS coordinates please!)
Here's hoping that one day, the headache of WhatsApp media storage will go away with a whimper.