Python Strings Internals Practice Problems & Exercises
Practice: Strings Internals and Immutability
← Back to lessonEasy
Demonstrate string immutability. The code below tries to change the first character of a string. It will fail. Catch the error, then fix it by creating a new string with the desired change.
s = "Python"
# Step 1: Try to assign to index 0 — catch the TypeError
try:
s[0] = "H"
except TypeError as e:
print(f"TypeError caught: {e}")
# Step 2: Create a new string with 'H' replacing the first character
fixed = "H" + s[1:]
print(f"Fixed: {fixed}")Solution
s = "Python"
try:
s[0] = "H"
except TypeError as e:
print(f"TypeError caught: {e}")
fixed = "H" + s[1:]
print(f"Fixed: {fixed}")
Why it works: Strings are immutable sequences in Python. Every "modification" actually creates a brand new string object. The original s still points to "Python" — it was never changed. This is a fundamental design choice: immutability enables string interning, hash caching, and thread safety.
Expected Output
TypeError caught: 'str' object does not support item assignment\nFixed: HythonHints
Hint 1: Strings in Python are immutable — you cannot assign to an index.
Hint 2: To "change" a string, you must create a new one using slicing or concatenation.
Predict the output of each is comparison. Think about which strings CPython interns automatically.
a = "hello"
b = "hello"
print(f"a is b: {a is b}")
c = "hello_world"
d = "hello_world"
print(f"c is d: {c is d}")
e = "hello world"
f = "hello world"
print(f"e is f: {e is f}")Solution
a is b: True
c is d: True
e is f: False
a is bisTrue:"hello"is a short string that looks like a valid identifier — CPython interns it at compile time, so both names point to the same object.c is disTrue:"hello_world"also looks like a valid identifier (only letters, digits, underscores). CPython interns these too.e is fisFalse:"hello world"contains a space — it does not look like an identifier, so CPython creates two separate string objects.
Key insight: String interning is a CPython implementation detail, not a language guarantee. Never use is to compare string values — always use ==.
Expected Output
a is b: True\nc is d: True\ne is f: FalseHints
Hint 1: CPython interns string literals that look like valid Python identifiers.
Hint 2: Strings with spaces or special characters are generally NOT interned automatically.
Predict the output of each expression. Pay attention to what * and + do with strings.
print("-" * 10)
print("***" + "ALERT" + "***")
print(("ab " * 3))
print(len("hi" * 4))Solution
----------
***ALERT***
ab ab ab
8
"-" * 10repeats the dash character 10 times."***" + "ALERT" + "***"concatenates three strings into one new string."ab " * 3repeats"ab "(including the trailing space) three times.len("hi" * 4)first creates"hihihihi"(8 characters), then returns its length.
Key insight: Every * and + operation allocates a new string object. The original strings are never modified. This matters for performance when you do it in a loop (see Problem 5).
Expected Output
----------\n***ALERT***\nab ab ab \n8Hints
Hint 1: String `*` repeats the string N times. String `+` concatenates.
Hint 2: Every operation creates a new string object — the originals are unchanged.
Write a function format_report(name, score, max_score) that returns a formatted string using f-strings. The percentage must always display exactly 2 decimal places.
print(format_report("Alice", 87, 100))
print(format_report("Bob", 45, 100))
print(format_report("Carol", 100, 100))
Solution
def format_report(name, score, max_score):
pct = (score / max_score) * 100
return f"Student: {name} | Score: {score}/{max_score} ({pct:.2f}%)"
Why it works: F-strings (formatted string literals, PEP 498) evaluate expressions inside {} at runtime and embed the result. The :.2f format specification means "floating-point with exactly 2 decimal places." F-strings are the fastest string formatting method in CPython because they compile to efficient bytecode — faster than % formatting or .format().
def format_report(name, score, max_score):
"""Return a formatted string using f-strings.
Format: 'Student: Alice | Score: 87/100 (87.00%)'
The percentage should always show exactly 2 decimal places.
"""
passExpected Output
Student: Alice | Score: 87/100 (87.00%)\nStudent: Bob | Score: 45/100 (45.00%)\nStudent: Carol | Score: 100/100 (100.00%)Hints
Hint 1: Use f-strings with expressions inside curly braces: f"{variable}".
Hint 2: For fixed decimal places, use format spec: f"{value:.2f}" gives 2 decimal places.
Medium
Implement Why Why Note: CPython has an optimization that can sometimes make build_with_join(n) to produce the same output as build_with_concat(n), but using str.join(). Then observe how the performance gap widens as n grows — this is the classic O(n^2) vs O(n) string building trap.Solution
+= is O(n^2): Each result += str(i) + "," creates a new string and copies the entire previous content. After n iterations, you have copied approximately 1 + 2 + 3 + ... + n = n(n+1)/2 characters total — that is O(n^2).join is O(n): str.join() first iterates the generator to collect all pieces, pre-calculates the total length needed, allocates one buffer, and copies each piece exactly once. Total work: O(n).+= appear faster for small strings (it tries to resize in-place if the refcount is 1), but this is fragile and should never be relied upon. Always use join for building strings from many pieces.
import time
def build_with_concat(n):
"""Build a string of n numbers using += concatenation."""
result = ""
for i in range(n):
result += str(i) + ","
return result
def build_with_join(n):
"""Build the same string using str.join()."""
pass
# Compare performance
for size in [1000, 10000, 50000]:
start = time.perf_counter()
build_with_concat(size)
concat_time = time.perf_counter() - start
start = time.perf_counter()
build_with_join(size)
join_time = time.perf_counter() - start
ratio = concat_time / join_time if join_time > 0 else float('inf')
print(f"n={size:>6}: concat={concat_time:.4f}s join={join_time:.4f}s ratio={ratio:.1f}x")Expected Output
n= 1000: concat=0.0005s join=0.0002s ratio=2.5x\nn= 10000: concat=0.0080s join=0.0015s ratio=5.3x\nn= 50000: concat=0.1200s join=0.0070s ratio=17.1x\n(Exact times vary — the ratio should grow with n)Hints
Hint 1: str.join() takes an iterable of strings and joins them with the separator.
Hint 2: The concat approach is O(n^2) because each += creates a new string and copies all previous characters.
Hint 3: join pre-calculates total length and copies once — O(n).
Demonstrate a full encode/decode roundtrip. Encode a string containing ASCII, CJK characters, and emoji to UTF-8 bytes, then decode it back. Show that byte length differs from character length.
text = "Hello, 世界! 🌍"
# Show original string and its type
print(f"Original: {text}")
print(f"Type: {type(text)}")
# Encode to UTF-8 bytes
encoded = text.encode("utf-8")
print(f"Encoded: {encoded}")
print(f"Type: {type(encoded)}")
# Compare byte length vs character length
print(f"Byte len: {len(encoded)}")
print(f"Char len: {len(text)}")
# Decode back to str
decoded = encoded.decode("utf-8")
print(f"Decoded: {decoded}")
print(f"Roundtrip: {text == decoded}")Solution
Original: Hello, 世界! 🌍
Type: <class 'str'>
Encoded: b'Hello, \xe4\xb8\x96\xe7\x95\x8c! \xf0\x9f\x8c\x8d'
Type: <class 'bytes'>
Byte len: 18
Char len: 12
Decoded: Hello, 世界! 🌍
Roundtrip: True
Byte count breakdown:
Hello,— 7 ASCII characters = 7 bytes (1 byte each in UTF-8)世界— 2 CJK characters = 6 bytes (3 bytes each in UTF-8)!— 2 ASCII characters = 2 bytes🌍— 1 emoji = 4 bytes (UTF-8 needs 4 bytes for codepoints above U+FFFF)- Total: 7 + 6 + 2 + 4 - 1 (the space is counted in "! ") = 18 bytes, but only 12 characters
Key insight: In Python 3, str is always Unicode text. bytes is raw binary data. You must encode/decode explicitly — Python 3 eliminated the implicit conversions that caused countless bugs in Python 2.
Expected Output
Original: Hello, 世界! 🌍\nType: <class 'str'>\nEncoded: b'Hello, \\xe4\\xb8\\x96\\xe7\\x95\\x8c! \\xf0\\x9f\\x8c\\x8d'\nType: <class 'bytes'>\nByte len: 18\nChar len: 12\nDecoded: Hello, 世界! 🌍\nRoundtrip: TrueHints
Hint 1: Use .encode("utf-8") to go from str to bytes, and .decode("utf-8") to go back.
Hint 2: UTF-8 uses variable-length encoding: ASCII = 1 byte, CJK = 3 bytes, emoji = 4 bytes.
Implement build_html_table using io.StringIO — the Python equivalent of Java's StringBuilder. This avoids the O(n^2) concatenation trap by writing pieces to a buffer and extracting the final string at the end.
headers = ["Name", "Age", "City"]
rows = [["Alice", "30", "NYC"], ["Bob", "25", "LA"]]
print(build_html_table(headers, rows))
Solution
import io
def build_html_table(headers, rows):
buf = io.StringIO()
buf.write("<table>\n")
# Header row
buf.write("<tr>")
for h in headers:
buf.write(f"<th>{h}</th>")
buf.write("</tr>\n")
# Data rows
for row in rows:
buf.write("<tr>")
for cell in row:
buf.write(f"<td>{cell}</td>")
buf.write("</tr>\n")
buf.write("</table>")
return buf.getvalue()
Why StringIO over +=: io.StringIO maintains an internal buffer that grows efficiently (similar to how list.append works with over-allocation). You write pieces to it without creating intermediate string copies, then call .getvalue() once at the end.
When to use which approach:
- Few pieces →
f-stringor+(clearest, fast enough) - Many pieces in a loop →
str.join()with a list (simplest efficient pattern) - Complex multi-line building →
io.StringIO(most flexible, allowswrite()calls anywhere)
import io
def build_html_table(headers, rows):
"""Build an HTML table string efficiently using io.StringIO.
Args:
headers: list of column header strings
rows: list of lists (each inner list is a row of cell values)
Returns:
Complete HTML table as a string
"""
passExpected Output
<table>\n<tr><th>Name</th><th>Age</th><th>City</th></tr>\n<tr><td>Alice</td><td>30</td><td>NYC</td></tr>\n<tr><td>Bob</td><td>25</td><td>LA</td></tr>\n</table>Hints
Hint 1: io.StringIO acts like a file you write strings to, then call .getvalue() for the result.
Hint 2: This avoids creating intermediate string objects on every concatenation.
Hint 3: The pattern: create buffer, write pieces, get final string.
Explore how Python strings use memory. Use sys.getsizeof to observe the base overhead and per-character cost for different string types.
import sys
# Empty string — just the object overhead
empty = ""
print(f"Empty string: {sys.getsizeof(empty)} bytes")
# ASCII strings of increasing length
for n in [1, 10, 100, 1000]:
s = "a" * n
size = sys.getsizeof(s)
per_char = (size - sys.getsizeof(empty)) / n
print(f"ASCII len={n:>4}: {size:>5} bytes ({per_char:.1f} bytes/char)")
# Strings with characters from different Unicode planes
ascii_str = "hello" # Latin-1 range (U+0000–U+00FF)
euro_str = "h\u00e9llo" # Latin-1 (accented e)
cjk_str = "h\u4e16llo" # BMP range (U+0100–U+FFFF)
emoji_str = "h\U0001f600llo" # Supplementary plane (U+10000+)
print(f"\nASCII '{ascii_str}': {sys.getsizeof(ascii_str)} bytes")
print(f"Latin1 '{euro_str}': {sys.getsizeof(euro_str)} bytes")
print(f"BMP '{cjk_str}': {sys.getsizeof(cjk_str)} bytes")
print(f"SMP '{emoji_str}': {sys.getsizeof(emoji_str)} bytes")Solution
Typical output on CPython 3.11+ (64-bit):
Empty string: 49 bytes
ASCII len= 1: 50 bytes (1.0 bytes/char)
ASCII len= 10: 59 bytes (1.0 bytes/char)
ASCII len= 100: 149 bytes (1.0 bytes/char)
ASCII len=1000: 1049 bytes (1.0 bytes/char)
ASCII 'hello': 54 bytes
Latin1 'hello': 54 bytes
BMP 'h世llo': 78 bytes
SMP 'h😀llo': 92 bytes
What is happening internally: CPython 3.3+ uses PEP 393 "flexible string representation." Every string is stored using the narrowest encoding that fits all its characters:
- Latin-1 (1 byte/char): all codepoints at or below U+00FF
- UCS-2 (2 bytes/char): any codepoint above U+00FF but below U+FFFF
- UCS-4 (4 bytes/char): any codepoint above U+FFFF (emoji, rare CJK, etc.)
The entire string is stored at the widest level needed. One emoji in a 1000-character ASCII string forces all 1000 characters to use 4 bytes each. This is why mixing ASCII and emoji can unexpectedly quadruple memory usage.
Expected Output
(Output varies by platform — see solution for typical values)Hints
Hint 1: sys.getsizeof returns the size of the object in bytes, including the overhead.
Hint 2: CPython uses different internal representations depending on the max codepoint: Latin-1 (1 byte/char), UCS-2 (2 bytes/char), UCS-4 (4 bytes/char).
Prove that string slicing creates new objects by comparing id() values. Investigate whether a full slice s[:] behaves differently from a partial slice.
s = "Engineering the Text Layer"
# Full slice
full_slice = s[:]
print(f"Original id: {id(s)}")
print(f"Full slice id: {id(full_slice)}")
print(f"Same object? {s is full_slice}")
# Partial slice
sub = s[0:11] # "Engineering"
print(f"Substring id: {id(sub)}")
print(f"Same object? {s is sub}")
print(f"Content equal? {s[0:11] == sub}")Solution
Original id: (some integer, e.g., 140234567890)
Full slice id: (same integer)
Same object? True
Substring id: (different integer)
Same object? False
Content equal? True
What happens internally:
- Full slice
s[:]: CPython optimizes this — since strings are immutable, there is no reason to copy. It returns the same object, sos[:] is sisTrue. - Partial slice
s[0:11]: This must create a new string object because it represents different content. A newstrobject is allocated, the characters are copied, and a newid()is assigned.
Why this matters: Unlike lists (where lst[:] creates a shallow copy), strings skip the copy when the full slice is taken. This is safe precisely because strings are immutable — no one can mutate the "copy" and accidentally affect the "original."
Warning: This is a CPython implementation detail. Other Python implementations (PyPy, Jython) may behave differently. Never write code that depends on s[:] is s being True.
Expected Output
Original id: (some integer)\nFull slice id: (same or different — see solution)\nSame object? True\nSubstring id: (different integer)\nSame object? False\nContent equal? TrueHints
Hint 1: Use id() to get the memory address of an object.
Hint 2: A full slice s[:] of a string may return the same object (CPython optimization).
Hint 3: A substring slice always creates a new string object.
Hard
Build an InternPool class that simulates CPython's string interning. It should deduplicate strings so that equal content always returns the same object, and track hit/miss statistics.
import sys
pool = InternPool()
# Without interning — two separate objects
a = "".join(["h", "e", "l", "l", "o"])
b = "".join(["h", "e", "l", "l", "o"])
print(f"a is b: {a is b} (before interning)")
# With interning — same object returned
a2 = pool.intern(a)
b2 = pool.intern(b)
print(f"a2 is b2: {a2 is b2} (after interning, same object)")
# Intern more strings
pool.intern("world")
pool.intern("hello") # hit — already in pool
pool.intern("world") # hit — already in pool
print(f"Pool size: {pool.pool_size()}")
print(f"Stats: {pool.stats()}")
print(f"Memory saved: {pool.memory_saved()}")
Solution
import sys
class InternPool:
def __init__(self):
self._pool = {}
self._stats = {"hits": 0, "misses": 0}
def intern(self, s: str) -> str:
if s in self._pool:
self._stats["hits"] += 1
return self._pool[s]
else:
self._stats["misses"] += 1
self._pool[s] = s
return s
def is_interned(self, s: str) -> bool:
return s in self._pool
def pool_size(self) -> int:
return len(self._pool)
def stats(self) -> dict:
return dict(self._stats)
def memory_saved(self) -> str:
if not self._pool:
return "~0 bytes"
# Each hit avoids allocating a duplicate string object.
# Estimate average string size from pool contents.
total_size = sum(sys.getsizeof(s) for s in self._pool)
avg_size = total_size / len(self._pool)
saved = int(self._stats["hits"] * avg_size)
return f"~{saved} bytes"
pool = InternPool()
a = "".join(["h", "e", "l", "l", "o"])
b = "".join(["h", "e", "l", "l", "o"])
print(f"a is b: {a is b} (before interning)")
a2 = pool.intern(a)
b2 = pool.intern(b)
print(f"a2 is b2: {a2 is b2} (after interning, same object)")
pool.intern("world")
pool.intern("hello") # hit
pool.intern("world") # hit
print(f"Pool size: {pool.pool_size()}")
print(f"Stats: {pool.stats()}")
print(f"Memory saved: {pool.memory_saved()}")
How real CPython interning works:
sys.intern(s)adds a string to the global intern table. Future lookups of the same content return the same object.- CPython automatically interns string literals that look like identifiers (used for attribute names, variable names, dict keys).
- The intern table is a regular dict — O(1) lookup. The trade-off: the table itself uses memory, but you save memory when many references point to the same content.
When to use sys.intern() in real code: When you have thousands of identical strings (e.g., parsing CSV column names, XML tag names, JSON keys) and want to reduce memory. Interning trades lookup time for memory savings.
class InternPool:
"""Simulate Python's string interning mechanism.
When a string is interned, future requests for the same
content return the SAME object (same id), saving memory.
"""
def __init__(self):
self._pool = {} # content -> interned string object
self._stats = {"hits": 0, "misses": 0}
def intern(self, s: str) -> str:
"""Intern a string. Return the canonical version.
If already interned, return the existing object (cache hit).
If new, store it and return it (cache miss).
"""
pass
def is_interned(self, s: str) -> bool:
"""Check if a string with this content is in the pool."""
pass
def pool_size(self) -> int:
"""Return number of unique strings in the pool."""
pass
def stats(self) -> dict:
"""Return hit/miss statistics."""
pass
def memory_saved(self) -> str:
"""Estimate memory saved by interning (based on hit count
and average string size in the pool)."""
passExpected Output
a is b: False (before interning)\na2 is b2: True (after interning, same object)\nPool size: 3\nStats: {'hits': 3, 'misses': 3}\nMemory saved: ~174 bytesHints
Hint 1: The pool is a dict mapping string content to the canonical string object.
Hint 2: On intern(): if the content exists in the pool, return the pooled object (hit). Otherwise, store it (miss).
Hint 3: For memory_saved, each hit avoids allocating a new string — estimate using sys.getsizeof.
Build a Unicode-aware character counter. Python's len() counts codepoints, not visible characters. When combining characters are used (like accent marks), len() overcounts. Your function should count grapheme clusters — what the user actually sees on screen.
tests = [
"caf\u0065\u0301", # cafe + combining acute accent = café
"\u006e\u0303a", # n + combining tilde + a = ña
"\u0065\u0301\u0065\u0308", # e+acute, e+diaeresis = éë
]
for text in tests:
result = analyze_string(text)
print(f"Text: {result['raw']!r} displays as: {result['display']}")
print(f"Codepoints: {result['codepoints']}")
print(f"Graphemes: {result['graphemes']}")
print(f"Bytes (UTF-8): {result['utf8_bytes']}")
print("---")
Solution
import unicodedata
def count_graphemes(text: str) -> int:
if not text:
return 0
count = 0
for i, char in enumerate(text):
# A character starts a new grapheme cluster if:
# - it's the first character, OR
# - it's NOT a combining character
if i == 0 or unicodedata.combining(char) == 0:
count += 1
return count
def analyze_string(text: str) -> dict:
return {
"raw": text,
"display": unicodedata.normalize("NFC", text),
"codepoints": len(text),
"graphemes": count_graphemes(text),
"utf8_bytes": len(text.encode("utf-8")),
}
tests = [
"caf\u0065\u0301",
"\u006e\u0303a",
"\u0065\u0301\u0065\u0308",
]
for text in tests:
result = analyze_string(text)
print(f"Text: {result['raw']!r} displays as: {result['display']}")
print(f"Codepoints: {result['codepoints']}")
print(f"Graphemes: {result['graphemes']}")
print(f"Bytes (UTF-8): {result['utf8_bytes']}")
print("---")
Three levels of "length" in Unicode:
- Bytes (
len(s.encode("utf-8"))) — how much space on disk/network - Codepoints (
len(s)) — number of Unicode codepoints - Grapheme clusters (
count_graphemes(s)) — what the user sees
Why this matters in production:
- Database
VARCHAR(N)limits may count bytes or codepoints — they differ for non-ASCII text. - Text truncation must happen at grapheme boundaries. Cutting in the middle of a combining sequence produces garbled output.
unicodedata.normalize("NFC", text)can collapse base + combining character into a single precomposed codepoint (e.g.,e+\u0301becomes\u00e9), which is why normalization should happen at system boundaries.
Limitation: This simplified approach does not handle all grapheme cluster edge cases (emoji with zero-width joiners, regional indicator sequences, etc.). For production use, the grapheme or regex library with \X provides full UAX #29 grapheme segmentation.
import unicodedata
def count_graphemes(text: str) -> int:
"""Count user-perceived characters (grapheme clusters), not codepoints.
For example, 'e\u0301' (e + combining acute accent) is ONE visible
character but TWO Unicode codepoints.
Approach: iterate through codepoints and check if each is a
combining character (unicodedata.combining() > 0). A combining
character merges with the preceding base character.
"""
pass
def analyze_string(text: str) -> dict:
"""Return a detailed breakdown of a Unicode string."""
passExpected Output
Text: 'cafe\\u0301' displays as: café\nCodepoints: 5\nGraphemes: 4\nBytes (UTF-8): 6\n---\nText: 'n\\u0303a' displays as: ña\nCodepoints: 3\nGraphemes: 2\nBytes (UTF-8): 4\n---\nText: 'e\\u0301e\\u0308' displays as: éë\nCodepoints: 4\nGraphemes: 2\nBytes (UTF-8): 6Hints
Hint 1: unicodedata.combining(char) returns 0 for base characters and nonzero for combining marks.
Hint 2: A grapheme cluster = one base character + zero or more combining characters.
Hint 3: Be careful: the first character is always a grapheme start, even if combining() returns nonzero.
Build a substring search engine using only built-in str methods (no re module). It must handle case-insensitive search, overlapping matches, multi-line text, and provide context around each match.
# Test 1: Multi-line case-insensitive search
text1 = """The fox jumped over the lazy dog.
The quick brown fox.
The dog jumps the fence."""
search_summary(text1, "the", case_sensitive=False)
print("---")
# Test 2: Overlapping matches
text2 = "banana banana"
search_summary(text2, "ana", case_sensitive=True)
print("(Overlapping matches detected)")
Solution
def search_all(text: str, pattern: str, case_sensitive: bool = True) -> list:
if not pattern:
return []
results = []
# For case-insensitive search, work on lowered copies
search_text = text if case_sensitive else text.lower()
search_pattern = pattern if case_sensitive else pattern.lower()
# Pre-compute line start positions for fast line number lookup
line_starts = [0]
for i, ch in enumerate(text):
if ch == "\n":
line_starts.append(i + 1)
def get_line_and_col(pos):
# Binary search for the line containing position `pos`
lo, hi = 0, len(line_starts) - 1
while lo < hi:
mid = (lo + hi + 1) // 2
if line_starts[mid] <= pos:
lo = mid
else:
hi = mid - 1
line_num = lo + 1 # 1-indexed
col = pos - line_starts[lo] + 1 # 1-indexed
return line_num, col
# Find all occurrences (overlapping allowed)
start = 0
plen = len(search_pattern)
ctx_window = 10
while start <= len(search_text) - plen:
pos = search_text.find(search_pattern, start)
if pos == -1:
break
line_num, col = get_line_and_col(pos)
# Extract context (from original text, not lowered)
ctx_start = max(0, pos - ctx_window)
ctx_end = min(len(text), pos + plen + ctx_window)
context = text[ctx_start:ctx_end].replace("\n", " ")
results.append({
"position": pos,
"line": line_num,
"col": col,
"context": context,
})
# Advance by 1 to allow overlapping matches
start = pos + 1
return results
def search_summary(text: str, pattern: str, case_sensitive: bool = True) -> None:
results = search_all(text, pattern, case_sensitive)
print(f"Pattern: '{pattern}' ({'case-sensitive' if case_sensitive else 'case-insensitive'})")
print(f"Found: {len(results)} match(es)")
print()
for r in results:
ctx = r["context"]
print(f" Line {r['line']:>3}, Col {r['col']:>3}: ...{ctx}...")
text1 = """The fox jumped over the lazy dog.
The quick brown fox.
The dog jumps the fence."""
search_summary(text1, "the", case_sensitive=False)
print("---")
text2 = "banana banana"
search_summary(text2, "ana", case_sensitive=True)
print("(Overlapping matches detected)")
Performance analysis:
str.find()uses a variant of the Boyer-Moore-Horspool algorithm internally (CPython uses a mix of Boyer-Moore and Horspool). Average case is O(n/m) where n is text length and m is pattern length.- The overlapping search advances by 1 after each match, so worst case for a pathological pattern like
"aaa"in"aaaa...a"is O(n*m). - Pre-computing line starts and using binary search for line number lookup makes each position lookup O(log L) where L is the number of lines, instead of O(n) for counting newlines each time.
Why not regex? For simple literal substring search, str.find() is faster than re.search() because it avoids the overhead of compiling a regex pattern and running the NFA/DFA engine. Use regex when you need pattern matching (wildcards, character classes, quantifiers). Use str.find() when you know the exact substring.
def search_all(text: str, pattern: str, case_sensitive: bool = True) -> list:
"""Find ALL occurrences of pattern in text.
Return list of dicts with position, line number, and context.
Must handle:
- Case-insensitive search
- Overlapping matches
- Multi-line text with line number tracking
- Context window (10 chars before/after match)
Use only str methods — no regex.
"""
pass
def search_summary(text: str, pattern: str, case_sensitive: bool = True) -> None:
"""Pretty-print search results with context."""
results = search_all(text, pattern, case_sensitive)
print(f"Pattern: '{pattern}' ({'case-sensitive' if case_sensitive else 'case-insensitive'})")
print(f"Found: {len(results)} match(es)")
print()
for r in results:
ctx = r["context"]
print(f" Line {r['line']:>3}, Col {r['col']:>3}: ...{ctx}...")Expected Output
Pattern: 'the' (case-insensitive)\nFound: 3 match(es)\n\n Line 1, Col 5: ...over the laz...\n Line 2, Col 1: ...The quick ...\n Line 3, Col 10: ...jumps the f...\n---\nPattern: 'ana' (case-sensitive)\nFound: 2 match(es)\n\n Line 1, Col 2: ...banana ban...\n Line 1, Col 4: ...nanana ban...\n(Overlapping matches detected)Hints
Hint 1: Use str.find() in a loop with a start parameter to find successive occurrences.
Hint 2: For overlapping matches, advance by 1 after each find (not by len(pattern)).
Hint 3: Track line numbers by counting newlines up to each match position.
Hint 4: For case-insensitive search, work on lowercased copies but report positions from the original.
