Privacy preserving face recognition

Recognizing faces and grouping them into people is a common feature of cloud-based photo apps. On platforms like Google Photos, this happens in big data centers with abundant computing resources. Since zeitkapsl encrypts all your photos before uploading them onto our cloud storage, we cannot do this on huge resource-intensive computers. All of this has to happen on your end device, and the results have to remain encrypted and private even when syncing across your devices. Since we made it our mission to reduce electronic waste, the whole face recognition pipeline needs to be able to run on older mobile devices that are not top-of-the-line flagships as well.

The resulting challenges delayed the release of zeitkapsl faces. In this post we want to dive into some of those challenges and how we tackled them, and give you some insights of how zeitkapsl faces works under the hood.

Demystifying machine learning: Neural networks & inference

Most of the machine learning algorithms we use for our face recognition pipeline are based on neural networks. Neural networks consist of interconnected nodes that process information. Each node does simple calculations on its input data and sends it to the next node until it reaches the output nodes of the network. The data from the output nodes encodes the desired information, for example the positions of faces in an image.

The process of evaluating a piece of data through such a network is called inference. Since those networks consist of millions of nodes (huge neural networks like the biggest LLMs can even reach trillions of nodes), inference is performance critical. For several years now, even consumer devices like phones and laptops have specialized hardware which accelerate inference while keeping the power consumption minimal. Neither the hardware, nor the API (programming interfaces) between the different hardware and operating systems are uniform, hence we rely on runtimes for inference.

Thanks to the larger machine learning community, there is already an array of inference runtimes and tools out there. PyTorch and TensorFlow are the two dominant frameworks, most neural network based models are either in the PyTorch or TensorFlow. We evaluated LiteRT (TensorFlows inference engine for mobile devices), ExecuTorch (PyTorchs on-device AI Framework) and the onnxruntime. The onnxruntime is a runtime for ONNX, an open format for storing and exchanging neural network based machine learning models. The onnxruntime seemed like the most suitable option for us at the time, it seems to be the most mature and consistent option across the platforms we support.

TensorFlow and PyTorch models can be converted to the ONNX format easily. This gives us the most flexibility with using different models and allows us to exchange models for newer, better ones as they are being developed.

High level overview

Face recognition involves much more than simply running a neural network on an image. It’s a multi-stage pipeline, where each step prepares and refines data for the next one to ensure accuracy and efficiency.

Face Detection: Identify all faces within an image using a trained detection model such as YOLO5Face.
Filtering: Discard low-quality or unrecognizable detections based on confidence, size, sharpness, and orientation.
Resize & Crop: Normalize valid faces to a consistent input size (e.g., 112×112) the neural network can work with.
Face Embedding: Convert each face into a numerical vector (embedding) that represents its unique characteristics.
Clustering: Group similar embeddings together using a density-based algorithm (DBSCAN) to identify which faces likely belong to the same person.

First things first: Face detection

The first step is, of course, to detect faces in images. Fortunately for us, face detection is a very well researched topic in computer vision and there is a wide array of algorithms and approaches. In our very first prototypes, we used the dlib-library, which is a machine learning toolkit with face detection built in. While we were able to build research prototypes quickly (there are bindings to dlib for go, the programming language our core library is written in), packaging dlib for our mobile apps turned out to be a bit more tricky.

We also considered using the face detection models of the machine learning libraries that are shipped with the respective operating systems (CoreML for the Apple ecosystem, MLKit for Android etc). The advantage would be that we keep the size of our app to a minimum since we don’t have to ship our own models and inference engines. However, each of these models works differently, and we want to have the same results on all platforms. In addition, we don’t want to depend on more libraries than necessary, especially if they come from Big Tech corporations, which sometimes tend to drop support for their products out of the blue.

We decided to instead choose a model that we can run in the onnxruntime to be more flexible if we decide to exchange it for another one.

After some research and experimentation, we decided to go for YOLO5Face. YOLO5Face is a specialized model for detecting faces, based on YOLOv5, which is a very commonly used model for object detection. Being so widely used, the input and output formats are well documented, which made it a little bit easier to build the pre- and postprocessing pipelines.

Emphasis on a little bit. Doing all of the necessary data preparation still turned out to be tricky, since we were all relatively new to computer vision and machine learning. First, it took us several attempts to get the image format right. Resizing is just the first step. YOLO5Face expects an image in the size of 640x640 in CHW ordering and represented as floating point numbers between 0 and 1. We’ll go through that step by step:

Resizing is the easiest part. We need to resize the image so that we have a 640x640 square, with the long edge of the image fitting the square perfectly, which means we have to fill the remaining part of the square with neutral gray pixels, since we don’t want to stretch or crop our image.
The next part is the ordering. For each pixel, we have three color channels, red, green and blue. The HWC (height-width-channel) ordering groups channels per pixel, while CHW (channel-height-width) groups pixels per channel.
The values of each pixel per channel can be represented as a number. Most commonly, this is done by storing one byte per channel (color) per pixel, so each channel gets an integer in the range of 0 to 255. YOLO5Face instead expects a floating point number (decimal point number) between 0 and 1.

Since we are using the onnxruntime, and ONNX supports preprocessing pipelines, we were able to integrate those conversions (HWC to CHW, integer to floating point) into the ONNX file when converting YOLO5Face from PyTorch to ONNX.

When running the raw YOLO5Face-model, we don’t simply get one result per face, but due to the nature of the model, many overlapping results for every face. Each result contains a bounding box around the detected face as well as a confidence score. For handling the duplicates, we use an approach that is called Non-maximum suppression or NMS. NMS eliminates all boxes that overlap with another box with a higher confidence score. Fortunately, we didn’t have to implement that from scratch, since ONNX comes with an operator for NMS. Working with ONNX was new territory for us though, so it was a bit tricky to get to a working ONNX file with all the pre- and postprocessing correctly integrated.

YOLO5Face models come in different sizes. The bigger the model, the higher the accuracy, but also, the higher computational resources. The s (small) variant of YOLO5Face turned out to be good enough for our usecase. Maybe even too good… YOLO5Face detecting an unrecognizable, tiny face

Garbage in means garbage out: Filtering faces for face recognition

Since YOLO5Face detects faces that are unrecognizable, either due to size, picture quality (blur, lighting etc), the angle of the face or being only partly visible, we need to filter the face recognitions from YOLO5Face before passing them on to our face recognition model. Criteria we currently filter on are the confidence level of the face detection result, the size of the detected face, the aspect ratio (this helps us to filter out faces that are only partly visible in the image), whether the face is actually facing the camera as well as whether the image section is blurry or over-/underexposed.

Getting to the core: Face recognition

Before going into the details, first, a short explanation on how face recognition works with models that create embeddings. Embeddings are vectors, pieces of data that consist of a fixed amount of numbers. We call each of those numbers a dimension. Simplified, one can imagine each dimension to represent a feature, for example the relative distance between eyes or the size of ones nose. Pictures of the same face should output embeddings that are similar to each other.

As with the face detection models, there are several options for face recognition models. We had a closer look into the models by Insightface, since they seemed promising and deliver good results in Immich. However, the most interesting models are only licensed for non-commercial use, so we had to keep looking.

After researching newer face recognition models, we came to the conclusion that MobileFaceNet, released in 2018, is still the best option for us. The MobileFaceNet inference is ridiculously fast (taking only a few milliseconds per face on a mid-range smartphone) and the quality of the face recognition is adequate.

MobileFaceNet expects the input images in a fixed size (112x112 pixels), therefore, we need to first crop out the faces from the original images using the data from the face recognitions and resize them to 112x112, filling the yet unfilled space of the square with gray pixels. Converting MobileFaceNet from PyTorch to ONNX was a small task after our struggles with YOLO5Face, since MobileFaceNet doesn’t need much data pre- or postprocessing after that. Now, we just have to extract the resulting embedding (which actually was a bit tricky for the iPhone app since the onnxruntime doesn’t have native bindings for Swift, the programming language we use to develop our iOS app, but only for Objective-C, which resulted in us having to write a memory decoder “by hand”) and pass them over from the respective platforms to our core library. (This was also kinda tricky, but interoperability between mobile app development platforms and go, which we use in our core library, is a topic for a future blog post)

Bringing together what belongs together: Clustering

Now, we have embeddings that we can compare to each other. If our embeddings only had three dimensions, we could imagine an embedding as a point floating in a room, each dimension representing its position along one of the three axis of the room. The distance between two points would then (in the case of the euclidian distance) be the length of the line that directly connects the two points. In the case of MobileFaceNet, our embeddings don’t just have three dimensions, but rather 128. Impossible to visualize, but fortunately, the math doesn’t change with the additional dimensions.

Our next challenge is to automatically order those embeddings into clusters with similar embeddings (embeddings that lie close to each other), with each cluster ideally representing one person in your photo library. This might sound relatively simple at first, but there are some additional challenges:

Since most of our photo libraries grow continuously, we can’t simply cluster all found faces once, but we have to grow our clusters with our photo library. Also, given all our limitations (especially the computational limitations of having to run everything locally on your smartphone), the recognitions will never be perfect; there will always be some faces that are assigned to the wrong person, or multiple face clusters belonging to the same person. Our clustering mechanism must be flexible enough to allow those changes and ideally “learn” from them to improve the assignment of new photos to the existing person clusters. Also, we don’t know how many distinct persons are in a library in advance, which leads to many common clustering algorithms like k-Means being unusable for our problem.

We decided to use an adapted version of the DBSCAN clustering algorithm. DBSCAN is great for finding clusters of complex shapes since it uses the density of points to find clusters. If you’re interested on how that works, here is a great video explainer which visualizes the concept nicely.

In a nutshell, the original DBSCAN algorithm takes each point (in our case embedding), looks if it finds a certain amount (the minimum density parameter) of neighbors within a defined distance (the distance threshold/epsilon parameter). If it does, it is considered a core point and creates a new cluster. All neighbors of the core point within the same distance threshold are also added to the cluster. Every point that is not close to a core point is considered noise, which in our context probably means that the face doesn’t appear often in the photo library, but might be some random person in the background of an image. One limitation of DBSCAN is that it is not incremental, meaning you cannot add new points after the clustering has been done once. We had to extend DBSCAN with the capability to add new points to existing clusters. Here, the great work from the open source community inspired us, and we decided to implement a similar approach to the one Immich uses. Every night, if your phone is plugged in, we take all the faces which haven’t yet found a cluster and try to either find an existing cluster for them or build a new one if that is possible with the distance threshold and minimum density parameters we set. If a face doesn’t find a cluster this way, it remains unclustered and we try the same thing again once new faces are added to the library.

Finding the correct parameters for DBSCAN is very tricky, due to a few circumstances: First, the density of our collection of face embeddings naturally increases with the amount of faces, making the fixed density parameters suboptimal for libraries with various sizes. We will continue to experiment with different, maybe even dynamic parameters.

Second, every photo library is different from each other. Due to our commitments to privacy and end-to-end encryption, we of course cannot experiment on the libraries of our users, even if some of them would consent to that. We did numerous experiments on our own libraries, but we of course cannot cover the great range of diversity among our users photo libraries. Hence, we rely on your feedback, do not hesitate to contact us and share how accurate zeitkapsl is in organizing your library.

Always at your fingertips, but only yours: Encryption & Sync

Remember, the whole face recognition pipeline happens locally on your device. This means that initially, all the face embeddings as well as the information of which person each face belongs to, are stored on this single device only. Since organizing faces and people is best done using our web app (which does not support running the face recognition pipeline itself for the time being), we need a way of synchronizing the results of the face recognition across your devices in a way that optimally preserves your privacy. Ideally, we also want to minimize high-level metadata like the amount of photos that contain the same person, or the amount of people being in a picture, regardless of the metadata of all the persons (like their names) being encrypted with the same zero-knowledge end-to-end encryption that we use for all of your photos and videos.

To ensure this, we take advantage of the fact that not just your photos and videos are stored encrypted, but also the metadata, like where a photo has been taken or which device you used to film a video. When we detect a face in an image, we add the resulting embedding as well as the information to which person the face belongs to, to the metadata information of the image. This information is then being encrypted together with the metadata of the image, resulting in a single blob of data that is indistinguishable from random noise. Only with your cryptographic key, this data blob can be decrypted on your device.

We already use a similar approach for other search labels that are extracted from photos and videos on your device. Combining all the metadata this way has another privacy advantage: Even from the size of the encrypted metadata-blob, we cannot deduce what an image contains. A large metadata blob can be the result of many causes: Maybe the photo it belongs to contains a lot of text, many people, or the camera you used to take it just has a very long name,… we could only guess.

Wrapping it up

We hope this article gave you an insight into both how face recognition works in zeitkapsl as well as how we implement new features without compromising on our promises: Finding innovative but sustainable solutions that secure your privacy and let you keep control of your data.