Authors:
(1) Joshua P. Ebenezer, Student Member, IEEE, Laboratory for Image and Video Engineering, The University of Texas at Austin, Austin, TX, 78712, USA, contributed equally to this work (e-mail: joshuaebenezer@utexas.edu);
(2) Zaixi Shang, Student Member, IEEE, Laboratory for Image and Video Engineering, The University of Texas at Austin, Austin, TX, 78712, USA, contributed equally to this work;
(3) Yixu Chen, Amazon Prime Video;
(4) Yongjun Wu, Amazon Prime Video;
(5) Hai Wei, Amazon Prime Video;
(6)Sriram Sethuraman, Amazon Prime Video;
(7) Alan C. Bovik, Fellow, IEEE, Laboratory for Image and Video Engineering, The University of Texas at Austin, Austin, TX, 78712, USA.
To the best of our knowledge, there do not exist any studies that compare the subjective qualities of videos as the dynamic range, resolution, and compression levels are all varied. Existing databases such as LIVE Livestream [2], LIVE ETRI [3], LIVE YTHFR [4], AVT UHD [5], and APV LBMFR [6] study the subjective quality of professionallygenerated SDR videos under conditions of downsampling, compression, and source distortions. Other datasets including Konvid-1k [7], YouTube UGC [8], and LSVQ [9] study the quality of SDR user-generated content. UGC databases are typically much larger than those that study the quality of professional-grade content because they can be conducted online via crowdsourcing owing to looser requirements on the display devices, resolution, and bitrate. LIVE HDR [10], LIVE AQ HDR [11], and APV HDR Sports [12] are recent databases that study the quality of professionally-created HDR videos that have been downsampled and compressed at various resolutions and bitrates.
Each of the above-mentioned databases study the quality of either SDR or HDR videos, but not both., that have been subject to distortions. Here we present the first subjective study that compares the quality of HDR and SDR videos of the same content, that have been processed by downscaling and compression. We conducted the study on a variety of display devices using different technologies and having differing capabilities.
While subjective human scores from studies like the one we conducted are considered the gold standards of video quality, conducting such studies is expensive is not scalable. However, objective video quality metrics are designed and trained to automatically predict video quality and can be quite economic and scalable. These fall into two categories: FullReference (FR) and No-Reference (NR) models. FR VQA models require as take as input both pristine and distorted videos to measure the quality of the distorted videos. NR metrics only have access to distorted videos when predicting quality, hence designing them is a more challenging problem. NR VQA models are relevant for video source inspection as well as when measuring quality with no available source video.
PSNR measures the peak signal to noise ratio between a reference frame and a distorted version of the same frame. SSIM [13] incorporates luminance, contrast, and structure features to predict the quality of distorted images. VMAF [14] models the statistics of the wavelet coefficients of video frames, as well as the detail losses from distortions. SpEED [15] measures the difference in entropy of bandpass coefficients of reference and distorted videos. STRRED [16] models the statistics of space-time video wavelet coefficients altered by distortions. STGREED [17] measures differences in temporal and spatial entropy arising from distortions to model the quality of videos having varying frame rates and bitrates.
BRISQUE [18], VBLIINDS [19], VIDEVAL [20], RAPIQUE [21], ChipQA [22], HDR ChipQA [23] and NIQE [24] are NR video quality metrics that rely on neurostatistical models of visual perception. Pristine videos are known to follow certain regular statistics when processed using visual neural models. Distortions predictably alter the statistics of perceptually processed videos, allowing for the design of accurate VQA models. RAPIQUE combines features developed under these models with (semantic) video features provided by a pre-trained deep network. TLVQM [25] explicitly models common distortions such as compression, blur, and flicker, using a variety of spatial and temporal filters and heuristics.
This paper is available on arxiv under CC 4.0 license.