MediaPipe tutorial: Find memes that match your facial expression š®
A post by Pierre Fenoll, Senior Lead Back-End Engineer at Powder.gg
Introduction
Powder.gg is a startup located in the beautiful district of Marais in Paris. We are working hard to help gamers edit their highlights with machine learning on mobile. AI, memes, sound and visual effects make up the toolkit we aim to provide to people all around the world to help them understand each other better. Our team has been using MediaPipe extensively to create our machine learning enabled video editing app for gamers.We created this tutorial to show how you can use MediaPipe to create your next ML enabled app.
Goals and requirements
Say youāre chatting on your phone with some friends. Your friend, Astrid wrote something amusing and you wanted to post a funny image related to hers . Thereās this picture you really want to post but itās not on your phone and now youāre just trying different words in your search engine of choice without any success. This image is so clear in your mind, the words to look it up just donāt come out right somehow. If only you could just make the same hilarious face to your phone and it would find it for you!
Letās build a computer vision pipeline that compares a face from a phoneās front camera (for instance) with a collection of images containing at least a face: a collection of Internet memes.
Hereās the app weāre building.
Machine Learning model for our pipeline
We need a machine learning model that can determine how similar 2 images with facial expressions are. The emotions/facial expression recognition module we use is based on a 2019 paper published by Raviteja Vemulapalli and Aseem Agarwala, from Google AI, titled A Compact Embedding for Facial Expression Similarity. At Powder, we re-implemented the approach described in that paper. We then improved on this approach by using knowledge distillation, a technique whereby an often smaller student network is trained to mimic the predictions made by a teacher network. We have found that using knowledge distillation leads to an even more accurate model. We also incorporated millions of additional unlabelled images in the knowledge distillation process, and found this improved performance even further.
Specifically, the original Google paper reported 81.8% ātriplet prediction accuracyā of their model on their face triplets dataset. Our re-implementation of this approach yielded closer to 79% accuracy, with the drop likely being due to our not-quite-complete reproduction of their dataset (due to source images being removed from Flickr). Using knowledge distillation with additional unlabelled data, we were then able to improve this score to 85%.
Our initial training pipeline was written in Pytorch. Considering that we want to create our inference pipelines using MediaPipe that has out of the box support for TensorFlow and TFlite models. We quickly migrated our training pipeline from Pytorch to TensorFlow. For this tutorial, we are releasing publicly a distilled MobileNetV2 version of our TFlite model with accuracy closer to 75%. This model should be sufficient to demonstrate the capabilities of MediaPipe and to have a bit of fun matching memes to faces.
For a quick introduction into MediaPipe
Prototyping the pipeline on the desktop
In order to prototype the inference pipeline using MediaPipe that will find memes that have similar facial expressions from a video, we started out building the desktop prototype as iteration can be much faster than with a mobile in the loop. After the desktop prototype, we can optimize for our main target platform: Apple iPhones. We start by creating a C++ demonstration program similar to the provided desktop examples and build our graphs from there.
Although it is possible to create a repository separate from MediaPipe (as demonstrated here), we prefer to develop our graphs and calculators in our own fork of the project. This way upgrading to the latest version of MediaPipe is just a git-rebase away.
To avoid potential conflicts with MediaPipe code we replicate the folder architecture under a subdirectory of mediapipe: graphs, calculators and the required BUILD files.
MediaPipe comes with many calculators that are documented, tested and sometimes optimized for multiple platforms so we try as much as possible to leverage them. When we really need to create new calculators we try to design them so that they can be reused in different graphs. For instance, we designed a calculator that displays its input stream to an OpenCV window and which closes on a key press. This way we can quickly plug into various parts of a pipeline and glance at the images streaming through.
MediaPipe Graph ā Face Detection followed by Face embedding
We construct a graph that finds faces in a video, takes the first detection then extracts a 64-dimensions vector describing that face. Finally it goes through a custom-made calculator that compares these embeddings against a large set of vector-images pairs and sorts the top 3 results using Euclidean distance.
Face detection is performed with MediaPipeās own SSD based selfie face detector graph. We turned the gist of it into a subgraph so our graph can easily link against it as well as some of our other graphs. Subgraphs are great for reusability and can be thought of as modules. We tend to create one per model just for the benefit of encapsulation.
Embeddings are extracted within the FaceEmbeddingsSubgraph using our Tensorflow Lite model and streamed out as a vector of 64 floats. This subgraph takes care of resizing the input image and converting to and from Tensorflow tensors.
Then our ClosestEmbeddingsCalculator expects this vector, computes the distance to each vector in the embedded database and streams out the top matches as Classifications, with the distance as score. The database is loaded as a side packet and can be generated with the help of a Shell script. A sample database of around 400 entries is provided.
Issues that you faced
You may have noticed in the above GPU FaceEmbeddingsSubgraph the use of the GpuBufferToImageFrame calculator. This moves an image from GPU to the host CPU. It turns out our model uses instructions that are not supported by the current version of the Tensorflow Lite GPU interpreter. Running it would always return the same values and output the following warnings when initializing the graph:
ERROR: Next operations are not supported by GPU delegate:
MAXIMUM: Operation is not supported.
MEAN: Operation is not supported.
First 0 operations will run on the GPU, and the remaining 81 on the CPU.
There are multiple ways to fix this:
- You can re-train your model such that the generated tflite file uses only supported operations.
- You can implement the operations in C++ and use MediaPipeās TfLiteCustomOpResolverCalculator to provide them to the interpreter.
- If your model runs fine on CPU you can just make sure inference always runs on CPU by moving its inputs to CPU. Moving data from GPU to host CPU is however not free so inference should be a bit slower.
We opted for the simplest option for this tutorial as runtime costs appeared minimal. We may be providing a model that can run on GPU in the future.
Running on the desktop
First make sure the images of our database are on your system. These come from https://imgflip.com/ and were selected on the criterion that they contain at least one human face. There are mostly pictures but some drawings as well. Download them with:
python mediapipe/examples/facial_search/images/download_images.py
Then you are free to re-generate the embeddings data. This process uses our graph on the images we downloaded to extract a float vector per image. These are then written to a C++ header file to constitute the database. Keep in mind that there can be some differences in floating point precision from one platform to another so you might want to generate the embeddings on the targeted platform. Run:
./mediapipe/examples/facial_search/generate_header.sh mediapipe/examples/facial_search/images
Now run the demo on CPU with:
And run the demo on GPU with:
Exit the program by pressing any key.
Going from desktop to app
Tulsi can be used to generate Xcode application projects and Bazel can be used to compile iOS apps from the command line. As iOS app developers we prefer to be able to import our graphs into our existing Xcode projects. We also plan to develop Android apps in the near future so we are betting on MediaPipeās multiplatform support to reduce code duplication.
We designed an automated way to package our graphs as iOS frameworks. This runs as part of our Continuous Integration pipeline with macOS GitHub Actions and relies on Bazel and some scripting. Framework compilation and import being two separate steps, our mobile developers neednāt worry about the C++ and Objective-C part of the graphs and can build apps in Swift.
General steps
- First, checkout our example code:
git clone ā single-branch facial-search https://github.com/fenollp/mediapipe.git && cd mediapipe
2. Create a new āSingle View Appā Swift app in Xcode
3. Delete these files from the new project: (Move to Trash)
- AppDelegate.swift
- ViewController.swift
4. Copy these files to your app from mediapipe/examples/facial_search/ios/: (if asked, do not create a bridging header)
- AppDelegate.swift
- Cameras.swift
- DotDot.swift
- FacialSearchViewController.swift
5. Edit your appās Info.plist:
- Create key āNSCameraUsageDescriptionā with value: āThis app uses the camera to demonstrate live video processing.ā
6. Edit your Main.storyboardās custom class, setting it to FacialSearchViewController (in the Identity inspector)
7. Build the iOS framework with
bazel build ā copt=-fembed-bitcode ā apple_bitcode=embedded ā config=ios_arm64 mediapipe/examples/facial_search/ios:FacialSearch
Some linker warnings about global C++ symbols may appear. The flags ā copt=-fembed-bitcode ā apple_bitcode=embedded enable bitcode generation.
8. Patch the Bazel product so it can be imported properly:
./mediapipe/examples/facial_search/ios/patch_ios_framework.sh bazel-bin/mediapipe/examples/facial_search/ios/FacialSearch.zip ObjcppLib.h
Note: append the contents of FRAMEWORK_HEADERS separated by spaces (here: ObjcppLib.h).
9. Open bazel-bin/mediapipe/examples/facial_search/ios and drag and drop the FacialSearch.framework folder into your app files (check: Copy items if needed > Finish)
10. Make sure the framework gets embedded into the app. In General > Frameworks, Libraries, and Embedded Content set FacialSearch.framework to āEmbed & Signā.
11. Connect your device and run.
Note the preprocessor statements at the top of FacialSearchViewController.swift.
There are two ways you can import our framework.
A. If it was imported using the above technique just use: import FacialSearch
B. If however you used Tulsi or Bazel to build the app you will have to use the longer form that reflects the Bazel target of the library the framework provides: import mediapipe_examples_facial_search_ios_ObjcppLib
Issues that you face
Why the need for patch_ios_framework.sh? It turns out the Apple rules of Bazel do not yet generate importable iOS frameworks. This was just not the intended usage of ios_framework. There is however an open issue tracking the addition of this feature.
Our script is a temporary workaround that adds the umbrella header and modulemap that Bazel does not create. This lists the Objective-C headers of your framework as well as the iOS system libraries either your code or MediaPipeās require to run.
Conclusion
We built a machine learning inference graph that uses two Tensorflow Lite models, one provided by the MediaPipe team and one we developed at Powder.gg. Then discussed using subgraphs and creating our own MediaPipe calculators. Finally we described how to bundle our graph into an iOS framework that can readily be imported into other Xcode projects.
Hereās a link to a GitHub repository with the code and assets mentioned in this post.
We sincerely hope you liked this tutorial and that it will help you on your way to create applications with MediaPipe.