Compression Techniques

Making files smaller than they would normally be is an important factor when transporting files across for example the Internet.  As you have seen from looking at sound and picture files these can make very large files indeed.  - A compression technique seeks to reduce the file size while still making the file usable. 

  

   

Loss-Less Compression 

Run Length Encoding

These methods seek to make a file smaller while keeping all the information so that it can be reproduced exactly.  An example of Loss-Less compression is a technique called run-length encoding.  This is often applied to a black and white image where instead of storing each pixel individually they are stored as 6 x Black, 12 X White, 3 x Black, 20 x White etc.  This usually results in a smaller file size but the original image can be recreated exactly. 

Of course we can encode binary bits using run length encoding - in fact that is what we are really doing when we say "black" or "white" above consider the following string of binary digits.

1111 1111 0011 1111 0000 0000 0011 0000 

The code for the above string would be.. 

8 x 1 : 2 x 0 : 6 x 1 : 10 x 0 : 2 x 1 : 4 x 0 

We can use this to compress a colour picture as well but it does not work well with full colour photographs - can you think why?

Dictionary Compression

Another method used for text files is called Dictionary Compression where words or part words are added to a list or dictionary and just the index number of that word / part word is stored instead of the whole word.  Because many words are repeated in a document this results (most of the time)  in a smaller file size. However, the file can be re-made by using the index sequence and retrieving the words / part words in order from the dictionary.  This can be done with whole words but also with word parts, consider how many times the three letter combination "ing" is used in the English language - if we code this as a number in a dictionary then every time we use "ing" we can replace it with a single number to look up.  The same applies to "the" - it is used as a word but also for the words  there,  they, then,  etc.   Like all lossless compression techniques Dictionary Compression identifies redundant data (i.e. data that repeats) and replaces it with a single instance of the data and a code to represent it. 

E.g.  Compress the text  "The future is something in the future not in the past" 

This is only one way of creating a dictionary compression of this text. 

The dictionary needs to be sent with the file in order to re-make the text exactly as it was previously.  

How much is the sentence above compressed? 

Ignoring spaces I have sent 21 characters in this compressed text - there were 43 characters in the original text - so you could say I saved about half the characters ! .... but I forgot to include the dictionary - that is 24 characters so in total we sent 45 characters - ah! I made the file bigger! 

What conclusions could you draw from this devastating news that this sentence became larger! 

Lossy Compression 

In these techniques, often used with images, sound or video files some of the information is removed from the file.  For example in sound any frequency above about 20,000 Hz or below about 20 Hz cannot be heard by humans so removing this information from the file will not make any difference to the perception of it.  The same applies to images where the human eye cannot see blue as well as other colours so removing some of this will not have much effect, we can also remove some detail without the average person noticing at all.  In video we can lower the frame rate and still have a watchable film. We could also only record data changes between frames rather than whole frames of video saving space but without losing perceived quality.

If you are interested in Lossy Compression you could look at the Khan Academy Page Here

Take a look at headphones adverts - see the frequency response of them.

JPEG, MP3, MPEG files are lossy compression formats.  -  They reduce the file size a lot but the original data cannot be totally re-constructed. 

For GCSE you do not need to know the details of how lossy compression works. It is enough to know that data is removed and cannot be re-instated exactly as the original file was before compression.

Loss-less -  Compression technique where the original file can be wholly re-created - EG Run Length Encoding or Dictionary Compression or Huffman Coding.

Lossy Compression - compression technique where some information is removed so the original file cannot be exactly recreated. E.G. JPEG pictures -  MPEG Video - MP3 Sound. 

Redundancy : ( Data) The number of data items in a file that are repeated.  In a Huffman Code the frequency of the letters, in Dictionary Compression words or part words that repeat.

Run Length Encoding - A process where "runs" of the same data are grouped together to make a number plus the data recorded once.  i.e. 8 x 1 followed by 10 x 0 etc. 

Dictionary Compression - A method for compressing text documents / files using a dictionary of words or part words.  Words in the sentences are replaced with index(key) values where the word or part word is in the dictionary.