What Are Face Datasets? An image dataset contains specially selected digital images intended to help train, test, and evaluate an artificial intelligence (AI) or machine learning (ML) algorithm, usually a computer vision algorithm. A face dataset is a type of image dataset that includes images of curated human faces, typically for an ML project. There are several publicly available face datasets that you can leverage instead of collecting your own training data. Managing and optimizing datasets for machine learning is one of the crucial stages in a . machine learning operations (MLOps) pipeline usually include faces in varying positions and lighting conditions, showing a full range of human emotions, ethnicities, ages, and additional characteristics. Face datasets are a major component of producing face recognition technologies. This field of computer vision has many use cases, including video surveillance, device security, and augmented reality (AR). Face datasets Top 3 Face Datasets Here are the most widely used face datasets. CelebFaces Attributes (CelebA) Dataset is a large face attribute dataset containing over 200,000 images of celebrities, each with 40 annotations for various attributes. The images in the CelebA dataset include many variations of background and pose. CelebA This dataset is useful for training or testing models for several computer vision tasks, such as face detection, face attribute recognition, facial landmark localization, face synthesis, and face image editing. The dataset is especially large, covering 10,177 celebrity identities, with a total of 202,599 face images across five landmark locations, and 40 binary attributes annotations for each image. Flickr-Faces-HQ (FFHQ) Dataset The dataset contains human face images, offering even more variation than the CelebA dataset. It covers a wide variety of ages, ethnicities, and backgrounds, providing significantly more variety of accessories like hats, eyeglasses, and sunglasses. The images are taken from Flickr and have been automatically cropped and aligned. FFHQ Originally intended as a benchmark for Generative Adversarial Networks (GANs), this dataset includes approximately 70,000 PNG images. The images are high quality, with a resolution of 1024/1024. Labeled Faces in the Wild (LFW) The LFW image dataset contains curated face photographs intended for researching face recognition technology without constraints. It consists of four separate image datasets, including an original set and three related sets with different types of images used for testing algorithms in different conditions. These aligned datasets include LFW-a, funneled images (ICCV 2007), and deep-funneled images (NIPS 2012). LFW-a and deep-funneled images generate higher quality results than regular or funneled images for most face recognition algorithms. This dataset has more than 13,000 face images collected from different online sources. Working with the Leading Face Datasets CelebA Accessing the Dataset The official webpage of the CelebA dataset is on link. There are multiple download links on the webpage that offer different variations of the dataset. In addition, there are ZIP files that contain both images and annotations. However, the annotation files are also separately provided as text files. The webpage also links to the original dataset present in a Baidu . this drive folder Using CelebA in Pytorch PyTorch provides the dataset directly through its module. Users can import the dataset directly and control the variation through parameters. The import has the following definition: torchvision.dataset ( torchvision.datasets.CelebA root, split = 'train', target_type = 'attr', transform = None, target_transform = None, download = False) Here is how each parameter is used: –specifies where the dataset will get downloaded to root s –specifies what part of the dataset is downloaded, can be ' ', or plit train', 'valid', 'test 'all' –a function that transforms an image transform target_type–specifies the type of target, can be the following values: : labels the attributes with binary values attr labels each image with the person’s identity identity: : specifies dimensions of each image’s bounding box bbox : specifies each image’s landmark features landmarks download–downloads dataset and places it in the root if True, doesn’t do so if the dataset is already downloaded Using CelebA in Tensorflow TensorFlow offers users to use the dataset through its tfds module directly. Users can download the dataset with the following command: tfds.load(‘celeb_a’, split=’train’, download=True) Since the dataset is pre-split between three categories , and ’ ), the parameter controls which part of the dataset gets downloaded. The dataset comes with a feature dictionary where each feature is a boolean, and the user can control what features should each downloaded picture have. (’train’, ’test’ validation’ split Flickr-Faces-HQ Dataset (FFHQ) The FFHQ dataset came to use when researchers trained an architecture using an alternative generative modeling technique called MvM on it. The technique differs from traditional GAN since it models geometric quantities like p-diameters and centroids. Accessing the Dataset The dataset comes with JSON metadata, a script for downloading it, and its documentation. There are two main ways to access the dataset: Google Drive: The dataset is available for direct download on the official Google Drive . link Download Script: The script comes with different options to download the images. It can also verify checksums, retry if downloading faces errors and use multiple connections for downloading the dataset. The scripts can take the following arguments when running it to customize the downloading process: - : Downloads the dataset’s metadata as a JSON file -json Displays the dataset’s statistics --stats: : Downloads the images in PNG format and a pixel density of 1024x1024 pixels (total download size: 89.1 GB) --images Downloads images in the PNG format with a pixel density of 128x128 (total download size: 1.95 GB) --thumbs: Download the original in-the-wild images in the PNG format (total download size: 955 GB) --wilds: s: Downloads the multi-resolution TFRecords (total download size: 273 GB) --tfrecord n: Recreates the images with a pixel density of 1024x1024 from the in-the-wild images --alig : Denotes the number of concurrent threads to download the dataset --num_threads : Denotes the number of times the script should try to download each image file in the dataset --num_attempts - : Keeps the original orientation of images and does not align -no-rotation - : Instructs to not apply blur-padding around and near the image’s borders -no-padding -- Sends the local directory with existing FFHQ source data source-dir: Labeled Faces in the Wild Home (LFW) The LFW dataset comes with two loaders: one called e for face identification and the other called for face verification. This tutorial uses the memmapped version existing in the through the utility. fetch_lfw_peopl fetch_lfw_pairs ~/scikit_learn_data/lfw_home/ joblib Using the Loader fetch_lfw_people This loader uses supervised learning to classify faces into multiple classes. This tutorial shows how to import the LFW dataset and show the celebrity in the image’s name. To use the fetch_lfw_people loader: Use the following command to fetch the dataset and loader: from sklearn.datasets import fetch_lfw_people people_from_lfw = fetch_lfw_people(min_faces_per_person=70, resize=0.4) Use the following code to print the names of the people in the dataset: for name in lfw_people.target_names: print(name) Each face in the dataset is assigned a single person id from the target array. Use the following code to get the ground truth data through the target array: lfw_people.target.shape list(lfw_people.target[:10]) Using the fetch_lfw_pairs Loader The loader comes in handy to check if two pictures belong to the same person or not. While fetching the loader, it is important to specify the particular subset of the dataset. To use the fetch_lfw_pairs loader: Use the following command to list the available face image pairs after importing the loader: from sklearn.datasets import fetch_lfw_pairs lfw_pairs_train_subset = fetch_lfw_pairs(subset='train') list(lfw_pairs_train_subset.target_names) The last command retrieves a list of two items: ['Different persons', 'Same person'] Conclusion In this article, I covered three of the most popular face datasets you can use to build your own face recognition and face detection models—CelebFaces, FFHQ, and LFW. I showed technical details that can help you retrieve the datasets and use them in your model code. I hope this will give you a head start on your next computer vision project.