String Compression with Run-Length Encoding

mediumPython

Lesson

Run-Length Encoding for Data Compression

Run-length encoding (RLE) is one of the simplest forms of data compression. It works by replacing sequences of repeated data with a count and the data value. This technique is particularly effective when data contains many consecutive repeated elements.

The basic idea is to scan through data sequentially and count how many times each element appears consecutively. Instead of storing 'aaaa', we store 'a4' - the character followed by its count. This can dramatically reduce storage space when there are long runs of repeated data.

RLE is commonly used in image compression (especially for simple graphics with large areas of solid color), fax transmission, and as a preprocessing step in more complex compression algorithms. It's also used in some database storage engines to compress columns with many repeated values.

The algorithm follows a simple pattern: track the current character and its count, then when you encounter a different character, output the previous character and its count before starting to track the new character. The tricky part is remembering to handle the final group after your loop ends.

One important consideration is that compression isn't always beneficial. If your data has mostly unique elements, RLE can actually make it larger! For example, 'abcd' would become 'a1b1c1d1e1' - twice as long. Good compression algorithms check if the compressed result is actually smaller before using it.

RLE teaches us about the trade-offs in algorithm design: simplicity versus efficiency, and the importance of measuring results rather than assuming an optimization actually optimizes.

Example
1def count_consecutive_chars(text): 2 """Count consecutive characters in a string""" 3 if not text: 4 return [] 5 6 result = [] 7 current_char = text[0] 8 count = 1 9 10 for char in text[1:]: 11 if char == current_char: 12 count += 1 13 else: 14 result.append((current_char, count)) 15 current_char = char 16 count = 1 17 18 result.append((current_char, count)) # Don't forget the last group! 19 return result 20 21# Example usage 22print(count_consecutive_chars("aaabbcc")) # [('a', 3), ('b', 2), ('c', 2)]
L7Start tracking from the first character
L12When character changes, save the previous group and start counting the new one
L17Critical: handle the final group after the loop ends

Key Takeaways

  • •Run-length encoding compresses by replacing consecutive repeated elements with count + element pairs
  • •Always verify that compression actually reduces size - sometimes the 'compressed' version is larger
  • •Remember to handle the final group of elements after your main processing loop
Loading...