Image Forensics

Motivation

A picture is worth a thousand words. Unfortunately, this truism has sometimes motivated people to create pictures artificially in cases where existing images do not make the desired point. The history of image forgery goes back millenia. However, in the digital age the tools for doing so have become more widely available and easier to use. Thus the need to be vigilant for image forgeries is greater than ever.

Image Forensics refers to the practice of studying an image to investigate the conditions of its creation, particularly to discover whether any modification has been made to the image since it was captured photographically. Some of the techniques used in image forensics overlap with computer vision research, while others do not. For example, one way to check an image's authenticity is to examine the metadata stored along with the image, including the details of its compression. This may provide a wealth of information, but does not rely on computer vision. Other methods examine statistics of the image pixel values, attempting to discover deviations from what would normally be expected in an unmodified image.

JPEG Compression

Most photographic images are stored in the JPEG format, a lossy compression technique first standardized in 1992. The JPEG algorithm divides an image into 8x8 blocks, converts them into the frequency domain using a discrete cosine transform, and quantizes the result. Many components are zeroed out under this approach, so the resulting bit string can be efficiently compressed via Huffman coding.

The quantization step depends both on the local image contrast and the level of compression desired. Thus recompression of an image that has already been compressed once will likely use a different set of quantization bins than the first time through. This results in a detectable effect in the final image ecoding -- either some quantization levels will be entirely empty (if the second quantization used finer bins) or some will be doubly filled (if the second quantization used coarser bins). Either of these effects will be evident on inspection of the quantization histogram, even though the differences in the final image will be undetectable to the eye. The techniques described below were developed in a research paper by Babak Mahdian and Stanislav Saic.

Investigation

Consider these two images: one, two (courtesy of http://froknowsphoto.com/d4s_raw_edit/). Download both and display them. Can you see any difference? Check their metadata using imfinfo. Does that reveal any differences?

Matlab normally discards the quantization used as it decodes a JPEG file. But with the proper library function written by Paul Sallee, we can return the information we need.

>> im = jpeg_read('dog1.jpg')
im = 
          image_width: 4928
         image_height: 3280
     image_components: 3
    image_color_space: 2
      jpeg_components: 3
     jpeg_color_space: 3
             comments: {}
          coef_arrays: {[3280x4928 double]  [1640x2464 double]  [1640x2464 double]}
         quant_tables: {[8x8 double]  [8x8 double]}
       ac_huff_tables: [1x2 struct]
       dc_huff_tables: [1x2 struct]
      optimize_coding: 0
            comp_info: [1x3 struct]
     progressive_mode: 0
>> lum = im.coef_arrays{im.comp_info(1).component_id};

The quantization levels for the grayscale image are computed in lum above. To see what they look like, we can plot a histogram of the values. (The components at zero are omitted below because they swamp the rest of the data).

hist(nonzeros(lum),min(lum(:)):max(lum(:)));

Compare the histograms for the two images. Can you tell which one has been compressed twice?

In the Wild

These images were deliberately prepared from high-resolution raw camera files and compressed either once or twice. With images of unknown origin it may be more difficult to detect double compression -- or maybe not. Can you find an image on the internet that appears to have been JPEG compressed more than once? Present it along with your evidence.

To Turn In

Show your results for the two images shown above, and your conclusions. Also try it out on these pairs of images: one, two. Finally, show the downloaded image that you discovered, along with its documentation.

As a challenge, see if you can write a Matlab function that will take an image file name as its argument and automatically return a value of 1 or 0 depending on whether the file has been doubly compressed or not. The authors of the paper linked above compare each quantization bin to its local neighbors, and determine whether the difference exceeds a threshold. You should experiment with this criterion to find a reasonable threshold, and you may need to ignore the bins close to zero since there is a lot of fluctuation there. Once you have something that seems to work, test it on a few images downloaded from news sites and other sources. Can you find any images that seem to have been compressed twice?