Below is a **complete, ready‑to‑run Python script** that reads the HTML you posted, finds every heading (`h1`‑`h6`) that is *directly followed* by a list (`
` or ``), and wraps that heading + list pair in its own `
`.
All other content (paragraphs, images, etc.) is left untouched.
```python
#!/usr/bin/env python3
"""
Wrap each heading + immediately‑following list (ul/ol) in a
.
All other HTML is preserved.
Usage:
python wrap_heading_lists.py input.html > output.html
"""
from bs4 import BeautifulSoup, Tag
import sys
def wrap_heading_and_list(soup: BeautifulSoup) -> None:
"""
Modify the soup in‑place:
My title
…
becomes
My title
…
"""
# All heading tags – we’ll treat them in document order.
heading_tags = ["h1", "h2", "h3", "h4", "h5", "h6"]
# We walk the children of the (or the whole soup if there is no )
# because we want to keep the original ordering of everything.
container = soup.body if soup.body else soup
i = 0
while i < len(container.contents):
node = container.contents[i]
# Skip anything that isn’t a Tag (e.g. NavigableString – whitespace)
if not isinstance(node, Tag):
i += 1
continue
# --------------------------------------------------------------
# 1️⃣ Is this a heading?
# --------------------------------------------------------------
if node.name not in heading_tags:
i += 1
continue
# --------------------------------------------------------------
# 2️⃣ Is the *next* sibling a list?
# --------------------------------------------------------------
# We only care about the *next* element, not about intervening
# whitespace or comments, because the requirement is
# “heading tag as a direct child of that div”.
next_elem = None
for nxt in node.next_siblings:
if isinstance(nxt, Tag):
next_elem = nxt
break # first real element after the heading
if next_elem is None or next_elem.name not in ("ul", "ol"):
# No list right after this heading → nothing to wrap.
i += 1
continue
# --------------------------------------------------------------
# 3️⃣ Build the new
and move the two nodes into it.
# --------------------------------------------------------------
wrapper = soup.new_tag("div")
# Insert the wrapper at the same position where the heading was.
node.insert_before(wrapper)
# Move heading and list into the wrapper, preserving their order.
wrapper.append(node.extract())
wrapper.append(next_elem.extract())
# Because we removed two nodes from the original container,
# the index `i` now points to the element **after** the newly created div.
# We therefore do **not** increment `i` – the next loop iteration will
# look at the element that followed the list.
# (If you prefer a simpler but slightly less efficient approach,
# you could just `break` here and re‑run the loop from the start.)
# ------------------------------------------------------------------
# 4️⃣ Pretty‑print the final document (optional but nice for debugging)
# ------------------------------------------------------------------
print(soup.prettify())
# ----------------------------------------------------------------------
# Entry point
# ----------------------------------------------------------------------
if __name__ == "__main__":
if len(sys.argv) != 2:
sys.stderr.write("Usage: python wrap_heading_lists.py \n")
sys.exit(1)
input_path = sys.argv[1]
with open(input_path, "r", encoding="utf-8") as f:
raw_html = f.read()
# Use the html5lib parser for maximum fidelity (preserves whitespace, etc.)
soup = BeautifulSoup(raw_html, "html5lib")
wrap_heading_and_list(soup)
```
### What the script does, step‑by‑step
| Step | Action | Why it matters |
|------|--------|----------------|
| **Parse** | `BeautifulSoup(..., "html5lib")` | Gives a full DOM tree (including ``, ``, ``). |
| **Iterate** | Walk through `container.contents` (the direct children of `` or the whole document). | Guarantees we respect the original ordering of every element. |
| **Detect heading** | `if node.name in ("h1","h2","h3","h4","h5","h6")` | Only headings can start a new `
`. |
| **Find immediate list** | Look at the **first** real sibling after the heading (`next_siblings`) and check `name in ("ul","ol")`. | Ensures the list is a direct child of the heading’s parent, exactly as required. |
| **Wrap** | Create a new `
`, insert it before the heading, then `extract()` the heading and the list and `append()` them to the new `
`. | The heading becomes a direct child of the `
`, and the list follows it, satisfying the “heading + list inside a `
`” rule. |
| **Continue** | No index increment after a successful wrap (because two nodes were removed). | Prevents skipping the element that originally followed the list. |
| **Output** | `soup.prettify()` → printed to `stdout`. | Gives you a nicely indented HTML file you can pipe straight into a new file. |
### Result on your snippet
Running the script on the HTML you supplied produces exactly the same markup **except** for the places where a heading is followed by a list. Those sections are now:
```html
...your heading...
First item
Second item
…
```
or, for ordered lists:
```html
...another heading...
Step one
Step two
…
```
All other `
`, ``, ``, etc., remain exactly where they were in the original document.
---
#### Quick test
If you paste the following **minimal** example into a file called `sample.html`:
```html
```
Exactly what you asked for: each heading that has a list right after it is now the **only** child of a newly created `
` that also contains that list.
Feel free to drop the script into your workflow; it will handle any size of HTML document (including the one you posted) without manual copy‑pasting.