Python Strings Internals Practice Problems & Exercises

Practice: Strings Internals and Immutability

12 problems4 Easy5 Medium3 Hard⏱ 45–60 min

Easy

#1Immutability WallEasy

stringsimmutabilityTypeError

Demonstrate string immutability. The code below tries to change the first character of a string. It will fail. Catch the error, then fix it by creating a new string with the desired change.

Python

s = "Python"

# Step 1: Try to assign to index 0 — catch the TypeError
try:
    s[0] = "H"
except TypeError as e:
    print(f"TypeError caught: {e}")

# Step 2: Create a new string with 'H' replacing the first character
fixed = "H" + s[1:]
print(f"Fixed: {fixed}")

Solution

s = "Python"

try:
    s[0] = "H"
except TypeError as e:
    print(f"TypeError caught: {e}")

fixed = "H" + s[1:]
print(f"Fixed: {fixed}")

Why it works: Strings are immutable sequences in Python. Every "modification" actually creates a brand new string object. The original s still points to "Python" — it was never changed. This is a fundamental design choice: immutability enables string interning, hash caching, and thread safety.

Expected Output

TypeError caught: 'str' object does not support item assignment\nFixed: Hython

Hints

Hint 1: Strings in Python are immutable — you cannot assign to an index.

Hint 2: To "change" a string, you must create a new one using slicing or concatenation.

#2Interning InspectorEasy

stringsinterningis-operator

Predict the output of each is comparison. Think about which strings CPython interns automatically.

Python

a = "hello"
b = "hello"
print(f"a is b: {a is b}")

c = "hello_world"
d = "hello_world"
print(f"c is d: {c is d}")

e = "hello world"
f = "hello world"
print(f"e is f: {e is f}")

Solution

a is b: True
c is d: True
e is f: False

a is b is True: "hello" is a short string that looks like a valid identifier — CPython interns it at compile time, so both names point to the same object.
c is d is True: "hello_world" also looks like a valid identifier (only letters, digits, underscores). CPython interns these too.
e is f is False: "hello world" contains a space — it does not look like an identifier, so CPython creates two separate string objects.

Key insight: String interning is a CPython implementation detail, not a language guarantee. Never use is to compare string values — always use ==.

Expected Output

a is b: True\nc is d: True\ne is f: False

Hints

Hint 1: CPython interns string literals that look like valid Python identifiers.

Hint 2: Strings with spaces or special characters are generally NOT interned automatically.

#3Multiply and ConcatenateEasy

stringsmultiplicationconcatenation

Predict the output of each expression. Pay attention to what * and + do with strings.

Python

print("-" * 10)
print("***" + "ALERT" + "***")
print(("ab " * 3))
print(len("hi" * 4))

Solution

----------
***ALERT***
ab ab ab
8

"-" * 10 repeats the dash character 10 times.
"***" + "ALERT" + "***" concatenates three strings into one new string.
"ab " * 3 repeats "ab " (including the trailing space) three times.
len("hi" * 4) first creates "hihihihi" (8 characters), then returns its length.

Key insight: Every * and + operation allocates a new string object. The original strings are never modified. This matters for performance when you do it in a loop (see Problem 5).

Expected Output

----------\n***ALERT***\nab ab ab \n8

Hints

Hint 1: String `*` repeats the string N times. String `+` concatenates.

Hint 2: Every operation creates a new string object — the originals are unchanged.

#4F-String Formatting BasicsEasy

stringsf-stringsformatting

Write a function format_report(name, score, max_score) that returns a formatted string using f-strings. The percentage must always display exactly 2 decimal places.

print(format_report("Alice", 87, 100))
print(format_report("Bob", 45, 100))
print(format_report("Carol", 100, 100))

Solution

def format_report(name, score, max_score):
    pct = (score / max_score) * 100
    return f"Student: {name} | Score: {score}/{max_score} ({pct:.2f}%)"

Why it works: F-strings (formatted string literals, PEP 498) evaluate expressions inside {} at runtime and embed the result. The :.2f format specification means "floating-point with exactly 2 decimal places." F-strings are the fastest string formatting method in CPython because they compile to efficient bytecode — faster than % formatting or .format().

Starter Code

def format_report(name, score, max_score):
    """Return a formatted string using f-strings.
    Format: 'Student: Alice | Score: 87/100 (87.00%)'
    The percentage should always show exactly 2 decimal places.
    """
    pass

Expected Output

Student: Alice | Score: 87/100 (87.00%)\nStudent: Bob | Score: 45/100 (45.00%)\nStudent: Carol | Score: 100/100 (100.00%)

Hints

Hint 1: Use f-strings with expressions inside curly braces: f"{variable}".

Hint 2: For fixed decimal places, use format spec: f"{value:.2f}" gives 2 decimal places.

Medium

#5Concatenation vs Join — The O(n^2) TrapMedium

stringsperformancejoinbig-o

Implement build_with_join(n) to produce the same output as build_with_concat(n), but using str.join(). Then observe how the performance gap widens as n grows — this is the classic O(n^2) vs O(n) string building trap.

Solution

import time

def build_with_concat(n):
    result = ""
    for i in range(n):
        result += str(i) + ","
    return result

def build_with_join(n):
    return ",".join(str(i) for i in range(n)) + ","

for size in [1000, 10000, 50000]:
    start = time.perf_counter()
    build_with_concat(size)
    concat_time = time.perf_counter() - start

    start = time.perf_counter()
    build_with_join(size)
    join_time = time.perf_counter() - start

    ratio = concat_time / join_time if join_time > 0 else float('inf')
    print(f"n={size:>6}: concat={concat_time:.4f}s  join={join_time:.4f}s  ratio={ratio:.1f}x")

Why += is O(n^2): Each result += str(i) + "," creates a new string and copies the entire previous content. After n iterations, you have copied approximately 1 + 2 + 3 + ... + n = n(n+1)/2 characters total — that is O(n^2).

Why join is O(n): str.join() first iterates the generator to collect all pieces, pre-calculates the total length needed, allocates one buffer, and copies each piece exactly once. Total work: O(n).

Note: CPython has an optimization that can sometimes make += appear faster for small strings (it tries to resize in-place if the refcount is 1), but this is fragile and should never be relied upon. Always use join for building strings from many pieces.

Starter Code

import time

def build_with_concat(n):
    """Build a string of n numbers using += concatenation."""
    result = ""
    for i in range(n):
        result += str(i) + ","
    return result

def build_with_join(n):
    """Build the same string using str.join()."""
    pass

# Compare performance
for size in [1000, 10000, 50000]:
    start = time.perf_counter()
    build_with_concat(size)
    concat_time = time.perf_counter() - start

    start = time.perf_counter()
    build_with_join(size)
    join_time = time.perf_counter() - start

    ratio = concat_time / join_time if join_time > 0 else float('inf')
    print(f"n={size:>6}: concat={concat_time:.4f}s  join={join_time:.4f}s  ratio={ratio:.1f}x")

Expected Output

n=  1000: concat=0.0005s  join=0.0002s  ratio=2.5x\nn= 10000: concat=0.0080s  join=0.0015s  ratio=5.3x\nn= 50000: concat=0.1200s  join=0.0070s  ratio=17.1x\n(Exact times vary — the ratio should grow with n)

Hints

Hint 1: str.join() takes an iterable of strings and joins them with the separator.

Hint 2: The concat approach is O(n^2) because each += creates a new string and copies all previous characters.

Hint 3: join pre-calculates total length and copies once — O(n).

#6Encode/Decode RoundtripMedium

stringsencodingutf-8bytes

Demonstrate a full encode/decode roundtrip. Encode a string containing ASCII, CJK characters, and emoji to UTF-8 bytes, then decode it back. Show that byte length differs from character length.

Python

text = "Hello, 世界! 🌍"

# Show original string and its type
print(f"Original:  {text}")
print(f"Type:      {type(text)}")

# Encode to UTF-8 bytes
encoded = text.encode("utf-8")
print(f"Encoded:   {encoded}")
print(f"Type:      {type(encoded)}")

# Compare byte length vs character length
print(f"Byte len:  {len(encoded)}")
print(f"Char len:  {len(text)}")

# Decode back to str
decoded = encoded.decode("utf-8")
print(f"Decoded:   {decoded}")
print(f"Roundtrip: {text == decoded}")

Solution

Original:  Hello, 世界! 🌍
Type:      <class 'str'>
Encoded:   b'Hello, \xe4\xb8\x96\xe7\x95\x8c! \xf0\x9f\x8c\x8d'
Type:      <class 'bytes'>
Byte len:  18
Char len:  12
Decoded:   Hello, 世界! 🌍
Roundtrip: True

Byte count breakdown:

Hello, — 7 ASCII characters = 7 bytes (1 byte each in UTF-8)
世界 — 2 CJK characters = 6 bytes (3 bytes each in UTF-8)
! — 2 ASCII characters = 2 bytes
🌍 — 1 emoji = 4 bytes (UTF-8 needs 4 bytes for codepoints above U+FFFF)
Total: 7 + 6 + 2 + 4 - 1 (the space is counted in "! ") = 18 bytes, but only 12 characters

Key insight: In Python 3, str is always Unicode text. bytes is raw binary data. You must encode/decode explicitly — Python 3 eliminated the implicit conversions that caused countless bugs in Python 2.

Expected Output

Original:  Hello, 世界! 🌍\nType:      <class 'str'>\nEncoded:   b'Hello, \\xe4\\xb8\\x96\\xe7\\x95\\x8c! \\xf0\\x9f\\x8c\\x8d'\nType:      <class 'bytes'>\nByte len:  18\nChar len:  12\nDecoded:   Hello, 世界! 🌍\nRoundtrip: True

Hints

Hint 1: Use .encode("utf-8") to go from str to bytes, and .decode("utf-8") to go back.

Hint 2: UTF-8 uses variable-length encoding: ASCII = 1 byte, CJK = 3 bytes, emoji = 4 bytes.

#7String Builder PatternMedium

stringsperformanceioStringIO

Implement build_html_table using io.StringIO — the Python equivalent of Java's StringBuilder. This avoids the O(n^2) concatenation trap by writing pieces to a buffer and extracting the final string at the end.

headers = ["Name", "Age", "City"]
rows = [["Alice", "30", "NYC"], ["Bob", "25", "LA"]]
print(build_html_table(headers, rows))

Solution

import io

def build_html_table(headers, rows):
    buf = io.StringIO()
    buf.write("<table>\n")

    # Header row
    buf.write("<tr>")
    for h in headers:
        buf.write(f"<th>{h}</th>")
    buf.write("</tr>\n")

    # Data rows
    for row in rows:
        buf.write("<tr>")
        for cell in row:
            buf.write(f"<td>{cell}</td>")
        buf.write("</tr>\n")

    buf.write("</table>")
    return buf.getvalue()

Why StringIO over +=: io.StringIO maintains an internal buffer that grows efficiently (similar to how list.append works with over-allocation). You write pieces to it without creating intermediate string copies, then call .getvalue() once at the end.

When to use which approach:

Few pieces → f-string or + (clearest, fast enough)
Many pieces in a loop → str.join() with a list (simplest efficient pattern)
Complex multi-line building → io.StringIO (most flexible, allows write() calls anywhere)

Starter Code

import io

def build_html_table(headers, rows):
    """Build an HTML table string efficiently using io.StringIO.
    
    Args:
        headers: list of column header strings
        rows: list of lists (each inner list is a row of cell values)
    Returns:
        Complete HTML table as a string
    """
    pass

Expected Output

<table>\n<tr><th>Name</th><th>Age</th><th>City</th></tr>\n<tr><td>Alice</td><td>30</td><td>NYC</td></tr>\n<tr><td>Bob</td><td>25</td><td>LA</td></tr>\n</table>

Hints

Hint 1: io.StringIO acts like a file you write strings to, then call .getvalue() for the result.

Hint 2: This avoids creating intermediate string objects on every concatenation.

Hint 3: The pattern: create buffer, write pieces, get final string.

#8Memory Footprint ExplorerMedium

stringsmemorysys.getsizeofinternals

Explore how Python strings use memory. Use sys.getsizeof to observe the base overhead and per-character cost for different string types.

Python

import sys

# Empty string — just the object overhead
empty = ""
print(f"Empty string:     {sys.getsizeof(empty)} bytes")

# ASCII strings of increasing length
for n in [1, 10, 100, 1000]:
    s = "a" * n
    size = sys.getsizeof(s)
    per_char = (size - sys.getsizeof(empty)) / n
    print(f"ASCII  len={n:>4}:   {size:>5} bytes  ({per_char:.1f} bytes/char)")

# Strings with characters from different Unicode planes
ascii_str = "hello"           # Latin-1 range (U+0000–U+00FF)
euro_str  = "h\u00e9llo"      # Latin-1 (accented e)
cjk_str   = "h\u4e16llo"     # BMP range (U+0100–U+FFFF)
emoji_str = "h\U0001f600llo"  # Supplementary plane (U+10000+)

print(f"\nASCII  '{ascii_str}': {sys.getsizeof(ascii_str)} bytes")
print(f"Latin1 '{euro_str}': {sys.getsizeof(euro_str)} bytes")
print(f"BMP    '{cjk_str}': {sys.getsizeof(cjk_str)} bytes")
print(f"SMP    '{emoji_str}': {sys.getsizeof(emoji_str)} bytes")

Solution

Typical output on CPython 3.11+ (64-bit):

Empty string:     49 bytes
ASCII  len=   1:      50 bytes  (1.0 bytes/char)
ASCII  len=  10:      59 bytes  (1.0 bytes/char)
ASCII  len= 100:     149 bytes  (1.0 bytes/char)
ASCII  len=1000:    1049 bytes  (1.0 bytes/char)

ASCII  'hello': 54 bytes
Latin1 'hello': 54 bytes
BMP    'h世llo': 78 bytes
SMP    'h😀llo': 92 bytes

What is happening internally: CPython 3.3+ uses PEP 393 "flexible string representation." Every string is stored using the narrowest encoding that fits all its characters:

Latin-1 (1 byte/char): all codepoints at or below U+00FF
UCS-2 (2 bytes/char): any codepoint above U+00FF but below U+FFFF
UCS-4 (4 bytes/char): any codepoint above U+FFFF (emoji, rare CJK, etc.)

The entire string is stored at the widest level needed. One emoji in a 1000-character ASCII string forces all 1000 characters to use 4 bytes each. This is why mixing ASCII and emoji can unexpectedly quadruple memory usage.

Expected Output

(Output varies by platform — see solution for typical values)

Hints

Hint 1: sys.getsizeof returns the size of the object in bytes, including the overhead.

Hint 2: CPython uses different internal representations depending on the max codepoint: Latin-1 (1 byte/char), UCS-2 (2 bytes/char), UCS-4 (4 bytes/char).

#9Slicing Creates New ObjectsMedium

stringsslicingidentityid

Prove that string slicing creates new objects by comparing id() values. Investigate whether a full slice s[:] behaves differently from a partial slice.

Python

s = "Engineering the Text Layer"

# Full slice
full_slice = s[:]
print(f"Original id:       {id(s)}")
print(f"Full slice id:     {id(full_slice)}")
print(f"Same object?       {s is full_slice}")

# Partial slice
sub = s[0:11]  # "Engineering"
print(f"Substring id:      {id(sub)}")
print(f"Same object?       {s is sub}")
print(f"Content equal?     {s[0:11] == sub}")

Solution

Original id:       (some integer, e.g., 140234567890)
Full slice id:     (same integer)
Same object?       True
Substring id:      (different integer)
Same object?       False
Content equal?     True

What happens internally:

Full slice s[:]: CPython optimizes this — since strings are immutable, there is no reason to copy. It returns the same object, so s[:] is s is True.
Partial slice s[0:11]: This must create a new string object because it represents different content. A new str object is allocated, the characters are copied, and a new id() is assigned.

Why this matters: Unlike lists (where lst[:] creates a shallow copy), strings skip the copy when the full slice is taken. This is safe precisely because strings are immutable — no one can mutate the "copy" and accidentally affect the "original."

Warning: This is a CPython implementation detail. Other Python implementations (PyPy, Jython) may behave differently. Never write code that depends on s[:] is s being True.

Expected Output

Original id:       (some integer)\nFull slice id:     (same or different — see solution)\nSame object?       True\nSubstring id:      (different integer)\nSame object?       False\nContent equal?     True

Hints

Hint 1: Use id() to get the memory address of an object.

Hint 2: A full slice s[:] of a string may return the same object (CPython optimization).

Hint 3: A substring slice always creates a new string object.

Hard

#10String Intern Cache SimulatorHard

stringsinterningsimulationdesign

Build an InternPool class that simulates CPython's string interning. It should deduplicate strings so that equal content always returns the same object, and track hit/miss statistics.

import sys

pool = InternPool()

# Without interning — two separate objects
a = "".join(["h", "e", "l", "l", "o"])
b = "".join(["h", "e", "l", "l", "o"])
print(f"a is b: {a is b}  (before interning)")

# With interning — same object returned
a2 = pool.intern(a)
b2 = pool.intern(b)
print(f"a2 is b2: {a2 is b2}  (after interning, same object)")

# Intern more strings
pool.intern("world")
pool.intern("hello")  # hit — already in pool
pool.intern("world")  # hit — already in pool

print(f"Pool size: {pool.pool_size()}")
print(f"Stats: {pool.stats()}")
print(f"Memory saved: {pool.memory_saved()}")

Solution

import sys

class InternPool:
    def __init__(self):
        self._pool = {}
        self._stats = {"hits": 0, "misses": 0}

    def intern(self, s: str) -> str:
        if s in self._pool:
            self._stats["hits"] += 1
            return self._pool[s]
        else:
            self._stats["misses"] += 1
            self._pool[s] = s
            return s

    def is_interned(self, s: str) -> bool:
        return s in self._pool

    def pool_size(self) -> int:
        return len(self._pool)

    def stats(self) -> dict:
        return dict(self._stats)

    def memory_saved(self) -> str:
        if not self._pool:
            return "~0 bytes"
        # Each hit avoids allocating a duplicate string object.
        # Estimate average string size from pool contents.
        total_size = sum(sys.getsizeof(s) for s in self._pool)
        avg_size = total_size / len(self._pool)
        saved = int(self._stats["hits"] * avg_size)
        return f"~{saved} bytes"

pool = InternPool()

a = "".join(["h", "e", "l", "l", "o"])
b = "".join(["h", "e", "l", "l", "o"])
print(f"a is b: {a is b}  (before interning)")

a2 = pool.intern(a)
b2 = pool.intern(b)
print(f"a2 is b2: {a2 is b2}  (after interning, same object)")

pool.intern("world")
pool.intern("hello")  # hit
pool.intern("world")  # hit

print(f"Pool size: {pool.pool_size()}")
print(f"Stats: {pool.stats()}")
print(f"Memory saved: {pool.memory_saved()}")

How real CPython interning works:

sys.intern(s) adds a string to the global intern table. Future lookups of the same content return the same object.
CPython automatically interns string literals that look like identifiers (used for attribute names, variable names, dict keys).
The intern table is a regular dict — O(1) lookup. The trade-off: the table itself uses memory, but you save memory when many references point to the same content.

When to use sys.intern() in real code: When you have thousands of identical strings (e.g., parsing CSV column names, XML tag names, JSON keys) and want to reduce memory. Interning trades lookup time for memory savings.

Starter Code

class InternPool:
    """Simulate Python's string interning mechanism.
    
    When a string is interned, future requests for the same
    content return the SAME object (same id), saving memory.
    """

    def __init__(self):
        self._pool = {}  # content -> interned string object
        self._stats = {"hits": 0, "misses": 0}

    def intern(self, s: str) -> str:
        """Intern a string. Return the canonical version.
        If already interned, return the existing object (cache hit).
        If new, store it and return it (cache miss).
        """
        pass

    def is_interned(self, s: str) -> bool:
        """Check if a string with this content is in the pool."""
        pass

    def pool_size(self) -> int:
        """Return number of unique strings in the pool."""
        pass

    def stats(self) -> dict:
        """Return hit/miss statistics."""
        pass

    def memory_saved(self) -> str:
        """Estimate memory saved by interning (based on hit count
        and average string size in the pool)."""
        pass

Expected Output

a is b: False  (before interning)\na2 is b2: True  (after interning, same object)\nPool size: 3\nStats: {'hits': 3, 'misses': 3}\nMemory saved: ~174 bytes

Hints

Hint 1: The pool is a dict mapping string content to the canonical string object.

Hint 2: On intern(): if the content exists in the pool, return the pooled object (hit). Otherwise, store it (miss).

Hint 3: For memory_saved, each hit avoids allocating a new string — estimate using sys.getsizeof.

#11Unicode-Aware Character CounterHard

stringsunicodecombining-charactersgrapheme

Build a Unicode-aware character counter. Python's len() counts codepoints, not visible characters. When combining characters are used (like accent marks), len() overcounts. Your function should count grapheme clusters — what the user actually sees on screen.

tests = [
    "caf\u0065\u0301",    # cafe + combining acute accent = café
    "\u006e\u0303a",      # n + combining tilde + a = ña
    "\u0065\u0301\u0065\u0308",  # e+acute, e+diaeresis = éë
]

for text in tests:
    result = analyze_string(text)
    print(f"Text: {result['raw']!r} displays as: {result['display']}")
    print(f"Codepoints: {result['codepoints']}")
    print(f"Graphemes:  {result['graphemes']}")
    print(f"Bytes (UTF-8): {result['utf8_bytes']}")
    print("---")

Solution

import unicodedata

def count_graphemes(text: str) -> int:
    if not text:
        return 0

    count = 0
    for i, char in enumerate(text):
        # A character starts a new grapheme cluster if:
        # - it's the first character, OR
        # - it's NOT a combining character
        if i == 0 or unicodedata.combining(char) == 0:
            count += 1
    return count

def analyze_string(text: str) -> dict:
    return {
        "raw": text,
        "display": unicodedata.normalize("NFC", text),
        "codepoints": len(text),
        "graphemes": count_graphemes(text),
        "utf8_bytes": len(text.encode("utf-8")),
    }

tests = [
    "caf\u0065\u0301",
    "\u006e\u0303a",
    "\u0065\u0301\u0065\u0308",
]

for text in tests:
    result = analyze_string(text)
    print(f"Text: {result['raw']!r} displays as: {result['display']}")
    print(f"Codepoints: {result['codepoints']}")
    print(f"Graphemes:  {result['graphemes']}")
    print(f"Bytes (UTF-8): {result['utf8_bytes']}")
    print("---")

Three levels of "length" in Unicode:

Bytes (len(s.encode("utf-8"))) — how much space on disk/network
Codepoints (len(s)) — number of Unicode codepoints
Grapheme clusters (count_graphemes(s)) — what the user sees

Why this matters in production:

Database VARCHAR(N) limits may count bytes or codepoints — they differ for non-ASCII text.
Text truncation must happen at grapheme boundaries. Cutting in the middle of a combining sequence produces garbled output.
unicodedata.normalize("NFC", text) can collapse base + combining character into a single precomposed codepoint (e.g., e + \u0301 becomes \u00e9), which is why normalization should happen at system boundaries.

Limitation: This simplified approach does not handle all grapheme cluster edge cases (emoji with zero-width joiners, regional indicator sequences, etc.). For production use, the grapheme or regex library with \X provides full UAX #29 grapheme segmentation.

Starter Code

import unicodedata

def count_graphemes(text: str) -> int:
    """Count user-perceived characters (grapheme clusters), not codepoints.
    
    For example, 'e\u0301' (e + combining acute accent) is ONE visible
    character but TWO Unicode codepoints.
    
    Approach: iterate through codepoints and check if each is a
    combining character (unicodedata.combining() > 0). A combining
    character merges with the preceding base character.
    """
    pass

def analyze_string(text: str) -> dict:
    """Return a detailed breakdown of a Unicode string."""
    pass

Expected Output

Text: 'cafe\\u0301' displays as: café\nCodepoints: 5\nGraphemes:  4\nBytes (UTF-8): 6\n---\nText: 'n\\u0303a' displays as: ña\nCodepoints: 3\nGraphemes:  2\nBytes (UTF-8): 4\n---\nText: 'e\\u0301e\\u0308' displays as: éë\nCodepoints: 4\nGraphemes:  2\nBytes (UTF-8): 6

Hints

Hint 1: unicodedata.combining(char) returns 0 for base characters and nonzero for combining marks.

Hint 2: A grapheme cluster = one base character + zero or more combining characters.

Hint 3: Be careful: the first character is always a grapheme start, even if combining() returns nonzero.

#12Efficient Substring Search EngineHard

stringssearchalgorithmsperformance

Build a substring search engine using only built-in str methods (no re module). It must handle case-insensitive search, overlapping matches, multi-line text, and provide context around each match.

# Test 1: Multi-line case-insensitive search
text1 = """The fox jumped over the lazy dog.
The quick brown fox.
The dog jumps the fence."""

search_summary(text1, "the", case_sensitive=False)

print("---")

# Test 2: Overlapping matches
text2 = "banana banana"
search_summary(text2, "ana", case_sensitive=True)
print("(Overlapping matches detected)")

Solution

def search_all(text: str, pattern: str, case_sensitive: bool = True) -> list:
    if not pattern:
        return []

    results = []

    # For case-insensitive search, work on lowered copies
    search_text = text if case_sensitive else text.lower()
    search_pattern = pattern if case_sensitive else pattern.lower()

    # Pre-compute line start positions for fast line number lookup
    line_starts = [0]
    for i, ch in enumerate(text):
        if ch == "\n":
            line_starts.append(i + 1)

    def get_line_and_col(pos):
        # Binary search for the line containing position `pos`
        lo, hi = 0, len(line_starts) - 1
        while lo < hi:
            mid = (lo + hi + 1) // 2
            if line_starts[mid] <= pos:
                lo = mid
            else:
                hi = mid - 1
        line_num = lo + 1  # 1-indexed
        col = pos - line_starts[lo] + 1  # 1-indexed
        return line_num, col

    # Find all occurrences (overlapping allowed)
    start = 0
    plen = len(search_pattern)
    ctx_window = 10

    while start <= len(search_text) - plen:
        pos = search_text.find(search_pattern, start)
        if pos == -1:
            break

        line_num, col = get_line_and_col(pos)

        # Extract context (from original text, not lowered)
        ctx_start = max(0, pos - ctx_window)
        ctx_end = min(len(text), pos + plen + ctx_window)
        context = text[ctx_start:ctx_end].replace("\n", " ")

        results.append({
            "position": pos,
            "line": line_num,
            "col": col,
            "context": context,
        })

        # Advance by 1 to allow overlapping matches
        start = pos + 1

    return results

def search_summary(text: str, pattern: str, case_sensitive: bool = True) -> None:
    results = search_all(text, pattern, case_sensitive)
    print(f"Pattern: '{pattern}' ({'case-sensitive' if case_sensitive else 'case-insensitive'})")
    print(f"Found: {len(results)} match(es)")
    print()
    for r in results:
        ctx = r["context"]
        print(f"  Line {r['line']:>3}, Col {r['col']:>3}: ...{ctx}...")

text1 = """The fox jumped over the lazy dog.
The quick brown fox.
The dog jumps the fence."""

search_summary(text1, "the", case_sensitive=False)

print("---")

text2 = "banana banana"
search_summary(text2, "ana", case_sensitive=True)
print("(Overlapping matches detected)")

Performance analysis:

str.find() uses a variant of the Boyer-Moore-Horspool algorithm internally (CPython uses a mix of Boyer-Moore and Horspool). Average case is O(n/m) where n is text length and m is pattern length.
The overlapping search advances by 1 after each match, so worst case for a pathological pattern like "aaa" in "aaaa...a" is O(n*m).
Pre-computing line starts and using binary search for line number lookup makes each position lookup O(log L) where L is the number of lines, instead of O(n) for counting newlines each time.

Why not regex? For simple literal substring search, str.find() is faster than re.search() because it avoids the overhead of compiling a regex pattern and running the NFA/DFA engine. Use regex when you need pattern matching (wildcards, character classes, quantifiers). Use str.find() when you know the exact substring.

Starter Code

def search_all(text: str, pattern: str, case_sensitive: bool = True) -> list:
    """Find ALL occurrences of pattern in text.
    Return list of dicts with position, line number, and context.
    
    Must handle:
    - Case-insensitive search
    - Overlapping matches
    - Multi-line text with line number tracking
    - Context window (10 chars before/after match)
    
    Use only str methods — no regex.
    """
    pass

def search_summary(text: str, pattern: str, case_sensitive: bool = True) -> None:
    """Pretty-print search results with context."""
    results = search_all(text, pattern, case_sensitive)
    print(f"Pattern: '{pattern}' ({'case-sensitive' if case_sensitive else 'case-insensitive'})")
    print(f"Found: {len(results)} match(es)")
    print()
    for r in results:
        ctx = r["context"]
        print(f"  Line {r['line']:>3}, Col {r['col']:>3}: ...{ctx}...")

Expected Output

Pattern: 'the' (case-insensitive)\nFound: 3 match(es)\n\n  Line   1, Col   5: ...over the laz...\n  Line   2, Col   1: ...The quick ...\n  Line   3, Col  10: ...jumps the f...\n---\nPattern: 'ana' (case-sensitive)\nFound: 2 match(es)\n\n  Line   1, Col   2: ...banana ban...\n  Line   1, Col   4: ...nanana ban...\n(Overlapping matches detected)

Hints

Hint 1: Use str.find() in a loop with a start parameter to find successive occurrences.

Hint 2: For overlapping matches, advance by 1 after each find (not by len(pattern)).

Hint 3: Track line numbers by counting newlines up to each match position.

Hint 4: For case-insensitive search, work on lowercased copies but report positions from the original.

Practice: Strings Internals and Immutability

Easy​

Medium​

Hard​

Easy

Medium

Hard