A Perceptual Video Coding Based on JND Model

Show more

1. Introduction

Nowadays, high definition video is becoming more and more popular. However, the growth of storage capacity and network bandwidth cannot meet the demands for high resolution for storage and transmission. Therefore, ITU-T and ISO/IEC worked together to release a new generation of efficient video coding standard―HEVC [1] . HEVC still follows the traditional hybrid coding framework and uses statistical correlation to remove space and time redundancy in order to achieve the highest possible compression effect. However, as the ultimate receiver of video, Human Visual System [2] has some visual redundancy due to its own characteristics. In order to get the perceptual redundancy, researchers have done a lot of work, of which the widely accepted model is the just noticeable distortion model. Video encoding based on perceptible distortion is mainly to use the human eye’s visual masking mechanism. When the distortion is less than the human sensitivity threshold, the human eye is imperceptible [3] . In recent years, the JND model has received wide attention in the aspects of video image encoding [4] [5] , digital watermarking [6] , image quality evaluation [7] and so on. At present, several JND models have been proposed: the JND model based on pixel domain and the JND model based on transform domain.

For the JND model based on pixel domain, it usually considers two main factors including luminance adaptive masking and contrast masking effect. C. H. Chou and Y. C. Li [8] proposed the pixel domain JND model for the first time. The lager one of the calculated luminance adaptive masking value and contrast masking effect value was used as the final JND threshold. Yang [9] and others proposed the classical nonlinear additively masking model. The two kinds of masking effects were added together to get the corresponding JND values. To some extent, the interaction between the two masking effects was considered. To solve the problem of lack of precision in the calculation of the contrast masking value for the above methods, Liu [10] assigned different weights to texture region and edge region in the image through texture decomposition on the basis of NAMM model, which made the JND model have better calculation accuracy. Wu [11] proposed a JND model based on luminance adaptive and structural similarity, which further considered the sensitivity of human eyes to different regular and irregular regions when computing texture masking.

The JND model based on transform domain could easily introduce the contrast sensitivity function into the model with high accuracy. Since most image coding standards adopt DCT transform, the JND model based on DCT domain has attracted much attention of researchers. Ahumada et al. [12] obtained a JND model of a grayscale image by calculating the spatial CSF function. Based on this, Waston [13] proposed the DCTune method, further considering the features of luminance adaptation and contrast masking. Zhang [14] made the JND model more accurate by adding a luminance adaptive factor and a contrast masking factor. Wei et al. [15] introduced gamma correction to the JND model and proposed a more accurate video image JND model.

2. Nonlinear Additively Masking Model

The NAMM model is simulated in pixel domain from the aspects of luminance adaptation and texture masking to obtain the JND threshold of pixel domain. The JND estimation based on the pixel domain can be written as the nonlinear additively of the luminance adaptation and the contrast masking, as shown in Equation (1):

${\text{JND}}_{\text{pixel}}\left(x,y\right)={T}^{l}\left(x,y\right)+{T}^{t}\left(x,y\right)-{C}^{lt}\cdot \mathrm{min}\left\{{T}^{l}\left(x,y\right),{T}^{t}\left(x,y\right)\right\}$ (1)

where,
${T}^{l}\left(x,y\right)$ and
${T}^{t}\left(x,y\right)$ denote the basic threshold of adaptive background luminance and texture masking; C^{lt} represents the overlapping part of two kinds of effects, and it is used to adjust the two factors. The larger the C^{lt} value is, the stronger superposition between the adaptive background luminance and texture masking is. When C^{lt} is 1, the superposition effect between the two factors is the greatest; when C^{lt} is 0, there is no superposition effect between the two effects. In fact, the superposition is between the maximum and the minimum, where C^{lt} is equal to 0.3.

Figure 1 shows the curve of the background luminance and the visual threshold obtained from the experimental results. It simulates the background luminance model and shows the distortion threshold that the human eye can tolerate under a certain background luminance.

${T}^{l}\left(x,y\right)$ can be determined according to the visual threshold curve in Figure 1.

${T}^{l}\left(x,y\right)=\{\begin{array}{l}17\left(1-\sqrt{\frac{{\stackrel{\xaf}{I}}_{Y}\left(x,y\right)}{127}}+3\right),\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\stackrel{\xaf}{I}}_{Y}\left(x,y\right)\le 127\\ \frac{3}{128}\left({\stackrel{\xaf}{I}}_{Y}\left(x,y\right)-127\right)+3,\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{others}\end{array}$ (2)

where ${\stackrel{\xaf}{I}}_{Y}\left(x,y\right)$ is the average background luminance value.

Due to the characteristics of HVS itself, distortion that occurs in plain and edge areas is more noticeable than texture areas. In order to estimate the JND threshold more accurately, it is necessary to distinguish the edge and non-edge regions. Therefore, considering the edge information, the calculation method of the texture masking threshold ${T}^{t}\left(x,y\right)$ is:

${T}^{t}\left(x,y\right)=\beta {G}_{\theta}\left(x,y\right){W}_{\theta}\left(x,y\right)$ (3)

where β is the control parameter and its value is set as 0.117. ${G}_{\theta}\left(x,y\right)$ donates the maximal weighted average of gradients around the pixel at (x, y); ${W}_{\theta}\left(x,y\right)$ is an edge-related weights of the pixel at (x, y), and its corresponding matrix ${W}_{\theta}$ is detected by the Gaussian low-pass filter.

Figure 1. Background luminance and visual threshold.

${G}_{\theta}\left(x,y\right)$ is defined as:

${G}_{\theta}\left(x,y\right)=\underset{k=1,2,3,4}{\mathrm{max}}\left\{gra{d}_{\theta ,k}\left(x,y\right)\right\}$ (4)

with

$gra{d}_{\theta ,k}\left(x,y\right)=\frac{1}{16}{\displaystyle \underset{i=1}{\overset{5}{\sum}}{\displaystyle \underset{j=1}{\overset{5}{\sum}}{I}_{\theta}\left(x-3+i,y-3+j\right)\times {g}_{k}\left(i,j\right)}}$ (5)

where, ${g}_{k}\left(i,j\right)$ are four directional high-pass filters for texture detection, as shown in Figure 2.

3. Improved JND Model Based on DCT Domain

A typical JND model based on DCT domain is expressed as a product of a base threshold and some modulation factors. Assume that t is expressed as the frame index in the video sequence, n is the block index in the tth frame, and (i, j) is the DCT coefficient index. Then the corresponding JND threshold can be expressed as:

${\text{JND}}_{\text{DCT}}\left(n,i,j,t\right)=T\left(n,i,j,t\right)\times {a}_{\text{Lum}}\left(n,t\right)\times {a}_{\text{Contrast}}\left(n,i,j,t\right)$ (6)

where $T\left(n,i,j,t\right)$ is the spatial-temporal base distortion threshold, which is calculated from the spatial-temporal contrast sensitivity function; ${a}_{\text{Lum}}\left(n,t\right)$ denotes the luminance adaptation factor; ${a}_{\text{Contrast}}\left(n,i,j,t\right)$ is expressed as a contrast masking factor.

3.1. Spatial-Temporal Contrast Sensitivity Function

In psychophysics experiments, the visual sensitivity of the human eye is related to the spatial frequency and time frequency of the input signal. The contrast sensitive function is usually used to quantify the relationship between these factors. It is defined as the inverse of the distortion perceived by human eye, when the contrast changes. The spatial-temporal contrast sensitivity function curve is shown in Figure 3. If we consider the (i, j) th in the nth DCT block in the tth frame, then the corresponding CSF function can be written as:

$\begin{array}{c}G\left(n,i,j,t\right)={c}_{0}\left({k}_{1}+{k}_{2}{\left|\mathrm{log}\left(\epsilon \cdot \nu \left(n,t\right)/3\right)\right|}^{3}\right)\cdot \nu \left(n,t\right)\cdot {\left(2\text{\pi}{\rho}_{i.j}\right)}^{2}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}\cdot \mathrm{exp}\left(-2\text{\pi}{\rho}_{i.j}\cdot {c}_{1}\cdot \left(\epsilon \cdot \nu \left(n,t\right)+2\right)/{k}_{3}\right)\end{array}$ (7)

where $\nu \left(n,t\right)$ depicts the associated retinal image velocity; the empirical constant ${k}_{1}$ , ${k}_{2}$ and ${k}_{3}$ are set as 6.1, 7.3 and 23. ${c}_{0}$ and ${c}_{1}$ control the magnitude and the bandwidth of a CSF curve; ${\rho}_{i.j}$ is the spatial subband frequency:

${\rho}_{i.j}=\frac{1}{2N}\sqrt{{\left(i/{\varpi}_{x}\right)}^{2}+{\left(j{\varpi}_{y}\right)}^{2}}$ (8)

where, ${\varpi}_{x}$ and ${\varpi}_{y}$ are the horizontal and vertical sizes of a pixel in degrees of visual angle, respectively. They are related to the viewing distance l and the display width Λ of a pixel on the monitor, as follows:

(a) (b) (c) (d)

Figure 2. Directional high-pass filters for texture detection.

Figure 3. Spatial CSF at different retinal velocities.

${\varpi}_{h}=2\cdot \mathrm{arctan}\left(\frac{{\Lambda}_{h}}{2\cdot l}\right),\text{\hspace{0.17em}}h=x,y$ (9)

when Equation (7) is used for predicting distortion threshold due to spatial-temporal CSF, several factors needs to be considered: 1) the sensitivity modeled by Equation (7) represents the inverse of distortion threshold; 2) the CSF threshold represented in the luminance needs to be scaled into the gray levels for digital image; 3) since Equation (7) comes from experimental data of one-dimensional spatial frequency, for any subband, the threshold is actually higher than the one given by Equation (7), and therefore a compensating needs to be introduced for a DCT sub-band. With all consideration mentioned above, the base threshold for a DCT sub-band is determined as:

$T\left(n,i,j,t\right)=\frac{1}{G\left(n,i,j,t\right)}\times \frac{M}{{\Phi}_{i}{\Phi}_{j}\left({L}_{\mathrm{max}}-{L}_{\mathrm{min}}\right)}\times \frac{1}{r+\left(1-r\right){\mathrm{cos}}^{2}{\theta}_{i,j}}$ (10)

where, ${L}_{\mathrm{max}}$ and ${L}_{\mathrm{min}}$ represent the display luminance values corresponding to the maximum and minimum gray levels, respectively; M is the number of gray levels, which is generally valued at 256; ${\Phi}_{i}$ and ${\Phi}_{j}$ belong to the DCT normalization factor; ${\theta}_{i,j}$ accounts for the effect of an arbitrary subband; r is set to 0.6.

3.2. Luminance Adaptive Factor and Contrast Masking Factor

The luminance masking mechanism is related to the brightness change in the image. According to Weber-Fechner’s law, the minimum perceptible luminance of human eye shows a higher threshold in the areas with brighter or darker background brightness, which is called luminance adaptive effect. The calculation formula of the luminance adaptive factor is:

${a}_{\text{Lum}}\left(n,t\right)=\{\begin{array}{l}\left(60-\stackrel{\xaf}{I}\right)/150+1,\text{\hspace{0.17em}}\stackrel{\xaf}{I}\le 60\\ 1,\text{\hspace{0.17em}}\text{\hspace{0.17em}}60\prec \stackrel{\xaf}{I}\prec 170\\ \left(\stackrel{\xaf}{I}-170\right)/425+1,\text{\hspace{0.17em}}\stackrel{\xaf}{I}\ge 170\end{array}$ (11)

where $\stackrel{\xaf}{I}$ represents the average brightness.

The contrast masking effect is an important perceptual property in the HVS, usually related to the awareness of a signal in the presence of another signal. When the contrast sensitivity factor is calculated, the image is first detected by Canny edge, and the image blocks are divided into three types: plain, edge and texture region. Since the human eye is more sensitive to distortions that occur in plain areas and in edge areas, different weights need to be assigned to different areas. Based on the above considerations, the weighted factor for each classification block is determined by the following equation:

$\psi =\{\begin{array}{l}1,\text{inplainandedgeregion}\\ 2.25,\text{intextureregionand}\text{\hspace{0.17em}}\left({i}^{2}+{j}^{2}\right)\le 16\\ 1.25,\text{intextureregionand}\text{\hspace{0.17em}}\left({i}^{2}+{j}^{2}\right)>16\end{array}$ (12)

where i and j are the DCT coefficient indices.

Taking the masking effect in the intra frame into account, the final contrast masking factor is:

${a}_{\text{contrast}}\left(n,i,j,t\right)=\{\begin{array}{l}\psi ,\text{\hspace{0.17em}}\text{inplainandedgeregion}\left({i}^{2}+{j}^{2}\right)\le \text{16}\\ \psi \cdot \mathrm{min}\left(4,max\left(1,{\frac{C\left(n,i,j,t\right)}{T\left(n,i,j,t\right)\cdot {a}_{\text{Lum}}\left(n,t\right)}}^{0.36}\right)\right),\text{\hspace{0.17em}}others\end{array}$ (13)

4. Simulation Results

4.1. Evaluation of the Improved JND Model Based on Transform Domain

In order to verify the effectiveness of our proposed JND model based on DCT domain, we selected eight test images of different contents and complexities as shown in Figure 4 to carry out simulation experiments. Theoretical analysis shows that under a certain visual quality, the larger the threshold of the JND model is, the more visual redundancy will be excavated. Under the same injected noise energy, a more accurate JND model leads to better perceived quality. In order to verify the validity of the model, the thresholds calculated by the corresponding JND models are introduced as noise into the DCT coefficients:

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 4. Eight test images. (a) Bikes; (b) Buildings; (c) Caps; (d) House; (e) Monarch; (f) Painted house; (g) Sailing 1; (h) Sailing 4.

${C}_{\text{noise}}\left(n,i,j,t\right)=C\left(n,i,j,t\right)+{M}_{n,i,j}^{\text{random}}\cdot \text{JND}\left(n,i,j,t\right)$ (14)

where, $C\left(n,i,j,t\right)$ and ${C}_{\text{noise}}\left(n,i,j,t\right)$ represent DCT coefficients and DCT coefficients after noise injection; ${M}_{n,i,j}^{\text{random}}$ random takes +1 and −1.

The JND model presented in this paper is compared with the three models shown in Table 1 respectively. As can be seen from the table, the PSNR measured by this model is the smallest. Under the same visual quality, the smaller the PSNR value of the image is, the greater the energy is introduced into the noise and the larger the corresponding JND threshold is. This means that the larger JND threshold obtained by this model can tolerate more distortion, and the accuracy of the model has been further improved.

4.2. The Overall Performance of the Perceptual Video Coding Scheme

In order to make full use of the JND characteristics of the human visual system to reduce the perceived redundancy of the input video, we integrated the designed JND model into the HEVC coding framework. For the transform skip mode, we chose the existing pixel domain JND model; and the proposed JND model based on DCT domain is utilized for the transform non-skip mode. Figure 5 shows the overall framework of the perceptual video coding scheme.

In order to verify the effectiveness of the algorithm proposed in this paper, the algorithm will be implemented on HM11.0, using the full I-frame encoding configuration environment. The initial quantization parameters are set to 22, 27, 32 and 37, respectively. The test sequences used in the experiment include Kimono, Cactus with a resolution of 1920 × 1080, BQMall, PartyScene with a resolution of 832 × 480, and Basketball Drill Text and China Speed for screen content encoding. We will evaluate the performance of the algorithm in terms of bit number reduction and encoding time. Compared to HM11.0, the bit rate reduction of the perceptual video coding scheme and the encoding time are calculated by the following formula:

Table 1. PSNR between different models.

Figure 5. Overall coding framework.

$\Delta \text{Bitrate}=\frac{{\text{Bitrate}}_{\text{Pro}}-{\text{Bitrate}}_{\text{ref}}}{{\text{Bitrate}}_{\text{Pro}}}\times 100\%$ (15)

$\Delta \text{Time}=\frac{{\text{Time}}_{\text{Pro}}-{\text{Time}}_{\text{ref}}}{{\text{Time}}_{\text{Pro}}}\times 100\%$ (16)

Table 2 shows the comparison of the performance of the proposed algorithm and Chen’s [17] and Bae’s [4] schemes under different quantization parameters. The experimental results show that compared with the algorithm of [4] , the algorithm reduces the encoding bit rate by 4.3%, compared with Chen’s algorithm, the encoding bit rate decreases by up to 7.58%.

In order to more intuitively show the bitrate reduction of each algorithm, Figure 6 shows the comparison of bitrates at different QP values. It can be observed that compared with other method, more bit saving can be obtained by our method in most cases. It also can be seen that the smaller the QP value is, the more bits are reduced. This is because finer quantification will result in a larger JND threshold.

Table 2. Comparison of the performance of each program.

Figure 6. Comparisons of the bitrates.

Acknowledgements

In this paper, we introduce JND model based on pixel domain and JND model based on DCT domain into HEVC framework. Both models have their own advantages: the JND model based on pixel domain can directly give the JND threshold in the pixel domain, and the calculation is easier. The JND model based on the DCT domain integrates the CSF function and the estimated value is more accurate. Based on the above analysis and combining the advantages of the two models, we choose the pixel domain JND model for the transform skip mode, and choose the more accurate DCT domain JND for the transform non-skip mode. Simulation experimental results show that compared with other models, this algorithm can save up to 7.58% of the coding rate.

References

[1] Sullivan, G.J., Ohm, J., Han, W.J. and Wiegand, T. (2012) Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Transaction on Circuits & Systems for Video Technology, 21, 1649-1668.

https://doi.org/10.1109/TCSVT.2012.2221191

[2] Wu, H.R. and Rao, K.P. (2005) Digital Video Image Quality and Perceptual Coding. CRC Press, Boca Raton, FL, USA.

https://doi.org/10.1201/9781420027822

[3] Jayant, N., Johnston, J. and Safranek, R. (1993) Signal Compression Based on Models of Human Perception. Proceedings of the IEEE, 81, 1385-1422.

https://doi.org/10.1109/5.241504

[4] Kim, J., Bae, S.H. and Kim, M. (2015) An HEVC-Compliant Perceptual Video Coding Scheme Based on JND Models for Variable Block-Sized Transform Kernels. IEEE Transactions on Circuits and Systems for Video Technology, 25, 1786-1800.

https://doi.org/10.1109/TCSVT.2015.2389491

[5] Ki, S., Bae, S.H., Kim, M. and Ko, H. (2018) Learning-Based Just Noticeable Quantization Distortion Modeling for Perceptual Video Coding. IEEE Transactions on Image Processing, 27, 3178-3193.

https://doi.org/10.1109/TIP.2018.2818439

[6] Wan, W., Liu, J., Sun, J., Ge, C. and Nie, X. (2015) Logarithmic STDM Watermarking Using Visual Saliency-Based JND Model. Electronics Letters, 51, 758-760.

https://doi.org/10.1049/el.2014.4329

[7] Wang, H. (2016) MCL-JCV: A JND-Based H.264/AVC Video Quality Assessment Dataset. 2016 IEEE International Conference on Image Processing, 25-28 September 2016, Phoenix, AZ, 1509-1513.

https://doi.org/10.1109/ICIP.2016.7532610

[8] Chou, C.H. and Li, Y.C. (1995) A Perceptual Tuned Sub-Band Image Coder Based on the Measure of Just-Noticeable-Distortion Profile. IEEE Transaction on Circuits & Systems for Video Technology, 5, 467-476.

https://doi.org/10.1109/76.475889

[9] Yang, X.K., Ling, W.S., Lu, Z.K., Ong, E.P. and Yao, S.S. (2005) Just Noticeable Distortion Model and Its Applications in Video Coding. Signal Processing Image Communication, 20, 662-680.

https://doi.org/10.1016/j.image.2005.04.001

[10] Liu, A., Lin, W., Paul, M., Deng, C. and Zhang, F. (2010) Just Noticeable Difference for Images with Decomposition Model for Separating Edge and Textured Regions. IEEE Transactions on Circuits & Systems for Video Technology, 20, 1648-1652.

https://doi.org/10.1109/TCSVT.2010.2087432

[11] Wu, J.J., Qi, F. and Shi, M. (2012) Self-Similarity Based Structural Regularity for Just Noticeable Difference Estimation. Journal of Visual Communication and Image Representation, 23, 845-852.

https://doi.org/10.1016/j.jvcir.2012.04.010

[12] Ahumada Jr., A.J. and Peterson, H.A. (1992) Luminance-Model-Based DCT Quantization for Color Image Compression. Human Vision, Visual Processing, & Digital Display III, 1666.

https://doi.org/10.1117/12.135982

[13] Watson, A.B. (1993) A Technique for Visual Optimization of DCT Quantization Matrices for Individual Images. Society for Display Digest of Technical Papers XXIV, 9, 946-949.

https://doi.org/10.2514/6.1993-4512

[14] Zhang, X.H., Lin, W.S. and Xue, P. (2005) Improved Estimation for Just-Noticeable Visual Distortion. Signal Processing, 85, 795-808.

https://doi.org/10.1016/j.sigpro.2004.12.002

[15] Wei, Z. and Ngan, K.N. (2009) Spatio-Temporal Just Noticeable Distortion Profile for Grey Scale Image/Video in DCT Domain. IEEE Transaction on Circuits and Systems for Video Technolo-gy, 19, 337-346.

https://doi.org/10.1109/TCSVT.2009.2013518

[16] Wu, J., Shi, G., Lin, W., Liu, A. and Qi, F. (2013) Just Noticeable Difference Estimation for Images with Free-Energy Principle. IEEE Transactions on Multimedia, 15, 1705-1710.

https://doi.org/10.1109/TMM.2013.2268053

[17] Chen, Z. and Guillemot, C. (2010) Perceptually-Friendly H.264/AVC Video Coding Based on Just-Noticeable-Distortion Model. IEEE Transactions on Circuits and Systems for Video Technology, 20, 806-819.

https://doi.org/10.1109/TCSVT.2010.2045912