Understanding Python Generators

What Is a Generator?

A generator is a special type of function that returns an iterator. Instead of computing all values at once and storing them in memory, generators produce values one at a time, on demand.

def count_up_to(n):
    i = 1
    while i <= n:
        yield i
        i += 1

for num in count_up_to(5):
    print(num)
# 1, 2, 3, 4, 5

The yield keyword is what makes this a generator function. Each time yield is encountered, the function pauses and produces a value. When the next value is requested, execution resumes right where it left off.

Generators vs Lists

Consider generating the first million square numbers:

# List approach - stores all values in memory
squares_list = [x**2 for x in range(1_000_000)]

# Generator approach - produces values on demand
squares_gen = (x**2 for x in range(1_000_000))

The list uses roughly 8 MB of memory. The generator uses almost nothing, regardless of how many values it can produce.

The `yield` Keyword

When Python encounters yield, it:

Returns the yielded value to the caller
Suspends the function’s state (local variables, instruction pointer)
Resumes from exactly that point on the next next() call

def simple_generator():
    print("First")
    yield 1
    print("Second")
    yield 2
    print("Third")
    yield 3

gen = simple_generator()
print(next(gen))  # Prints "First", returns 1
print(next(gen))  # Prints "Second", returns 2
print(next(gen))  # Prints "Third", returns 3

Generator Expressions

Just like list comprehensions, Python has generator expressions:

# List comprehension (eager)
evens = [x for x in range(100) if x % 2 == 0]

# Generator expression (lazy)
evens = (x for x in range(100) if x % 2 == 0)

Use generator expressions when you only need to iterate once, especially over large datasets.

Practical Example: Reading Large Files

Generators shine when processing data that does not fit in memory:

def read_large_file(file_path):
    with open(file_path, 'r') as f:
        for line in f:
            yield line.strip()

# Process a multi-gigabyte log file line by line
for line in read_large_file('huge_log.txt'):
    if 'ERROR' in line:
        print(line)

This processes the file one line at a time, never loading the entire file into memory.

Chaining Generators

You can compose generators to build data processing pipelines:

def read_lines(path):
    with open(path) as f:
        for line in f:
            yield line.strip()

def filter_errors(lines):
    for line in lines:
        if 'ERROR' in line:
            yield line

def extract_timestamps(lines):
    for line in lines:
        yield line.split(' ')[0]

# Pipeline
lines = read_lines('app.log')
errors = filter_errors(lines)
timestamps = extract_timestamps(errors)

for ts in timestamps:
    print(ts)

Each generator processes one item at a time. The entire pipeline uses constant memory regardless of file size.

Key Takeaways

Generators produce values lazily using yield
They are memory-efficient for large datasets
Generator expressions use () instead of []
Generators can be chained into processing pipelines
Use generators when you iterate once over large or infinite sequences