I'll just talk about the quality evaluation aspects of this study, as it is a 
field I know quite well (PhD on the topic, even if in video specifically). 

> I think the most important kind of comparison to do is a subjective blind 
> test with real people. This is of course produces less accurate results, but 
> more meaningful ones.

I don't get how more meaningful results may be less accurate... Running 
subjective quality tests is not as trivial as it sounds, at least to get 
meaningful results, as you say. Of course, you can throw a bunch of images to 
some naive observers with a nice web interface, but what about their screens 
differences? what about their light conditions differences? how do you validate 
people for the test (vision acuity, color blindness)? I've ran more than 600 
test sessions with around 200 different observers. Each one of them was tested 
before the session, and a normalized (ITU-R BT.500) room was dedicated to the 
process. I don't want to brag, I just mean it's a complicated matter, and not 
as sexy as it sounds :-)

In this study, you used several objective quality criteria (Y-SSIM, RGB-SSIM, 
IW-SSIM, PSNR-HVS-M). You say yourself: "It's unclear which algorithm is best 
in terms of human visual perception, so we tested with four of the most 
respected algorithms." Still, the ultimate goal of your test is to compare 
different degrading systems (photography coders here) at equivalent *perceived* 
quality. As your graphs show, they don't produce very consistent results 
(especially RGB-SSIM). SSIM-based metrics are structural, which means they 
evaluate how the structure of the image differ from one version to the other.  
Then they are very dependent of the content of the picture. Y-SSIM and IW-SSIM 
are only applied to luma channel, which is not optimal in your case, as image 
coders tend to blend colors. Still, IW-SSIM is the best performer in [1] (but 
it was the subject of the study), so why not. Your results with RGB-SSIM are 
very different than the others, disqualifying it for me. Plus, averaging SSIM 
over R, G and B channels has no sense for the human visual system. PSNR-HVS-M 
has the advantage to take into account a CSF to ponderate their PSNR, but it 
was designed over artificial artefacts, then you don't know how it performs 
over compression artefacts. None of these metrics use the human visual system 
at their heart. At best, they apply some HVS filter to PSNR or SSIM. For a more 
HVS-related metric, which tend to perform well (over 0.92 in correlation), look 
at [2] (from the lab I worked in). The code is a bit old now though, but an R 
package seems to be available.

You cite [1], in which they compare 5 algorithms (PSNR, IW-PSNR, SSIM, MS-SSIM, 
and IW-SSIM) over 6 subject-rated independent image databases (LIVE database, 
Cornell A57 database, IVC database, Toyama database, TID2008 database, and CSIQ 
database). These databases contain images and subjective quality evaluations 
obtained in normalized (i.e. repeatable) conditions. Most of them use JPEG and 
JPEG2000 compression, but not the others you want to test. The LIVE database is 
known not to be spread enough, resulting in high correlation in most studies 
(yet the reason why other databases emerged). If you want to perform your study 
further, consider using some of these data to start with.

Finally, be careful when you compute average of values, did you check their 
distribution first?

Stéphane Péchard

[1] https://ece.uwaterloo.ca/~z70wang/research/iwssim/
[2] http://www.irccyn.ec-nantes.fr/~autrusse/Komparator/index.html
_______________________________________________
dev-platform mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to