Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Post History

75%

+4 −0

Q&A How this recursive treewalker works?

How this recursive treewalker works? To understand what the code does, we need to first see what problem it's trying to solve and why it needs to be solved this way. Let's consider this HTML:...

posted 2y ago by hkotsubo‭ · edited 2y ago by hkotsubo‭

Answer

#2: Post edited by

hkotsubo‭ · 2022-05-02T14:32:23Z (over 2 years ago)

Copy Link

Raw

Markdown

The first part ("*How this recursive treewalker works?*") was already answered, so I'll focus on:
> *can there be an even simpler and more direct version without the `else`?*
**For this particular case**, yes, you can remove the `else` clause:
```javascript
function replaceIn(e) {
if (e.nodeType == Node.TEXT_NODE) {
e.nodeValue = e.nodeValue.replaceAll("a", "");
}
for (const child of e.childNodes) {
replaceIn(child);
}
}
```
That's because text nodes don't have child nodes, so their `childNodes` property will always be empty (hence, for those nodes, there will be no elements to be traversed in the `for..of` loop, and the final result will be the same).
---
But there's an important aspect to consider.
`if`/`else` statements exist for a reason. When I do:
```javascript
if (condition) {
A
} else {
B
}
```
I'm saying that if the condition is true, do `A`, otherwise, do `B`. It's a choice between one or another, based on the condition.
But if I do:
```javascript
if (condition) {
A
}
B
```
Then `B` is **always** done, regardless of the condition being true or false. Removing the `else` statement is not a matter of being simpler or more direct, it can completely change the algorithm.
As I said, **for this particular case**, both ends up being equivalent, just because text nodes don't have child nodes.
Nevertheless, I still prefer to keep the `else` clause, because it conveys intention and semantics: if it's a text node, do this, if it's not, do that. It makes clear that the loop must not be made on text nodes.

> *How this recursive treewalker works?*
To understand what the code does, we need to first see what problem it's trying to solve and why it needs to be solved this way.
Let's consider this HTML:
```html
<body>
<a href="www.whatever.com">Whatever <i>site</i></a>
<div><p>abc<span>def</span></p></div>
</body>
```
It's rendered as:
![Rendered HTML](https://software.codidact.com/uploads/vPCH9MoNoBqqvgsny5tiiqfm)
And it's represented by the following DOM tree:
![DOM tree for the HTML above](https://software.codidact.com/uploads/nb9x4uADYdGomSJiCuaNGWAy)
The same diagram in ASCII:
```none
+---------------+
| element: body |
| |
| childNodes |
+--- | ---------+
|
\__ body's first child node
| +------------+
| | element: a |
| | |
| | childNodes |
| +--- | ------+
| |
| a's first child node..........a's second child node
| +------------------------+ +------------+
| | text node: "Whatever " | | element: i |
| | | | |
| | no childNodes | | childNodes |
| +------------------------+ +--- | ------+
| |
| i's first child node
| +-------------------+
| | text node: "site" |
| | |
| | no childNodes |
| +-------------------+
|
\__ body's second child node
+--------------+
| element: div |
| |
| childNodes |
+--- | --------+
|
div's first child node
+------------+
| element: p |
| |
| childNodes |
+--- | ------+
|
p's first child node....p's second child node
+------------------+ +---------------+
| text node: "abc" | | element: span |
| | | |
| no childNodes | | childNodes |
+------------------+ +--- | ---------+
|
span's first child
+------------------+
| text node: "def" |
| |
| no child nodes |
+------------------+
```
The structure is basically:
- `body` has two child nodes: tags `a` and `div`
- `a` has two child nodes: a text node with the text _"Whatever "_, and tag `i`
- `i` has one child node: a text node with the text _"site"_
- `div` has one child node: tag `p`
- `p` has two child nodes: a text node with the text _"abc"_, and tag `span`
- `span` has one child node: a text node with the text _"def"_
- And the text nodes don't have child nodes.
> *<sup>Actually, all the line breaks are also text nodes (such as the line break after `<body>` opening, another one after `</a>` closing, etc), but I'm omitting those for the sake of simplicity.</sup>*
The [original problem](https://software.codidact.com/posts/286304) that resulted in this code is to replace some character in all parts of the document where that character appears. To avoid replacing things that shouldn't be replaced (such as HTML tags and attributes, as already explained in answers for the original question), a better solution is to replace only the text nodes.
Note that different parts of the text are spread through the DOM tree. In this case, there are four text nodes, each one in a different part of the hierarchy. To replace all text nodes, you'll need to traverse through the tree and visit all of them, one by one.
But to reach all text nodes, you need to visit all other nodes, because a node is only reachable by its parent: starting at `body`, you can reach `a` and `div`; from `a` you can reach the text node _"Whatever "_ and tag `i`, and so on. Therefore, only by traversing the whole structure, you can be sure that all text nodes were visited and replaced. In other words, you need something like this:
- start at the `body` element
- check its child nodes (`a` and `div`):
- for each child node you check if they are text nodes and replace them
- then, we must check *their* children: `a`'s child nodes and `div`'s child nodes
- `a`'s children are the text node _"Whatever "_ and tag `i`
- for each one of those child nodes, check if they are text nodes and replace them
- then check for those child nodes' child nodes (the text node has no children, but `i` does)
- and so on...
- `div`'s children is `p`
- for that child node, check if it's a text node and replace it
- then check for `p`'s child nodes
- and so on...
An initial pseudo-code to do that would be like this:
```none
if (body is text node) {
replace its text
} else {
for each body's child nodes, do:
if (child node is text node) {
replace its text
} else {
for each child node's child nodes, do:
if (child node's child node is a text node) {
replace its text
} else {
for each child node of child node's child nodes, do:
and so on...
}
}
}
```
A first problem with this approach is: I don't know how many levels the DOM tree has (I know for this specific one, but think of an algorithm that should work for any HTML page), so how can you know how many nested loops to use?
We _could_ assume some value ("_I'll work with N levels_"), but a better solution would work regardless of the number of levels. How could this be made?
We could start with a simple function that handles only the first level (`body` and its children):
```javascript
// first version, incomplete
function replaceIn(e) {
// check if the node is a text node
if (e.nodeType == Node.TEXT_NODE) {
// it's a text node, then replace its text
e.nodeValue = e.nodeValue.replaceAll("a", "");
} else {
// for each child node
for (const child of e.childNodes) {
// do something with the child node
}
}
}
```
Then we could call `replaceIn(document.body)`, and the function will do the following:
- check if `body` is a text node (if it is, replace it)
- but `body` isn't a text node, so it enters the `else` clause, and execute the loop in `body`'s child nodes
`body`'s child nodes are `a` and `div`, so we need another function to check those. This function must check if the child node is a text node. If it is, replace its text. If it's not, loop through its child nodes and check each one of them:
```javascript
// second version: function to check child nodes
function checkChild(childNode) {
if (childNode.nodeType == Node.TEXT_NODE) {
// it's a text node, then replace its text
childNode.nodeValue = childNode.nodeValue.replaceAll("a", "");
} else {
// check the childNode's child nodes
for (const child of childNode.childNodes) {
// do something with the child node
}
}
}
// now we can use the new function to check the child nodes
function replaceIn(e) {
// check if the node is a text node
if (e.nodeType == Node.TEXT_NODE) {
// it's a text node, then replace its text
e.nodeValue = e.nodeValue.replaceAll("a", "");
} else {
// for each child node
for (const child of e.childNodes) {
checkChild(child); // <---- ** HERE: ** use the new function to check the child node
}
}
}
```
Now when we call `replaceIn(document.body)`, the function will do the following:
- check if `body` is a text node (if it is, replace it)
- but `body` isn't a text node, so it enters the `else` clause, and execute the loop in `body`'s child nodes
- for each child node (in this case, `a` and `div`), calls the `checkChild` function.
- in the first iteration, it'll call `checkChild` passing the `a` element as argument
- the `checkChild` function checks if `a` is a text node
- it's not, so it enters the `else` clause and loops through `a`'s child nodes (which are the text node "Whatever " and the tag `i`)
- **now we need another function to check those child nodes**
- in the second iteration, it'll call `checkChild` passing the `div` element as argument
- the `checkChild` function checks if `div` is a text node
- it's not, so it enters the `else` clause and loops through `div`'s child nodes (which is the tag `p`)
- **now we need another function to check this child node**
With that, we could reach 2 levels (`body`'s child nodes and `body`'s "grandchild" nodes). But we ended up with the same problem: now we need a function to check the grandchildren. Maybe if we do that:
```javascript
// third version: function to check granchild nodes
function checkGrandChild(grandChildNode) {
if (grandChildNode.nodeType == Node.TEXT_NODE) {
// it's a text node, then replace its text
grandChildNode.nodeValue = grandChildNode.nodeValue.replaceAll("a", "");
} else {
// check the grandChildNode's child nodes
for (const child of grandChildNode.childNodes) {
// do something with the child node
}
}
}
// now we can use the new function to check the grandchild nodes
function checkChild(childNode) {
if (childNode.nodeType == Node.TEXT_NODE) {
// it's a text node, then replace its text
childNode.nodeValue = childNode.nodeValue.replaceAll("a", "");
} else {
// check the childNode's child nodes (AKA "grandchild" nodes)
for (const child of childNode.childNodes) {
checkGrandChild(child); // <---- ** HERE: ** use the new function to check the grandchild node
}
}
}
function replaceIn(e) {
// check if the node is a text node
if (e.nodeType == Node.TEXT_NODE) {
// it's a text node, then replace its text
e.nodeValue = e.nodeValue.replaceAll("a", "");
} else {
// for each child node
for (const child of e.childNodes) {
checkChild(child);
}
}
}
```
Now we could handle 3 levels, but still need another function to check the grand-grandchild nodes.
**Wait a minute**, have you noticed that the `checkChild` and `checkGrandChild` functions are doing basically the same as the `replaceIn` function? All of them receives a node and do the following:
1. if the node is a text node, replace its text
2. if it's not a text node, loop through its child nodes
- for each child node, call a function that will execute steps 1 and 2 on that node
And if all functions are doing the same thing, **why not use only one of them**? In this case, we could use the `replaceIn` function:
```javascript
function replaceIn(e) {
// check if the node is a text node
if (e.nodeType == Node.TEXT_NODE) {
// it's a text node, then replace its text
e.nodeValue = e.nodeValue.replaceAll("a", "");
} else {
// for each child node
for (const child of e.childNodes) {
replaceIn(child);
}
}
}
```
And that's it. When you call `replaceIn(document.body)`, it's executed as follows:
- **1:** check if `body` is a text node; it's not, so loops through `body`'s child nodes
- **2:** `body`'s first child is `a` -> **recursive call**
- **2.1:** check if `a` is a text node; it's not, so loops through `a`'s child nodes
- **2.2:** `a`'s first child is the text node _"Whatever "_ -> **recursive call**
- **2.2.1:** check if it's a text node. **It is**, so replace its text (it becomes _"Whtever"_)
- **2.3:** `a`'s second child is `i` -> **recursive call**
- **2.3.1:** check if `i` is a text node; it's not, so loops through `i`'s child nodes
- **2.3.2:** `i`'s first child is the text node _"site"_ -> **recursive call**
- **2.3.2.1:** check if it's a text node. **It is**, so replace its text (in this case, there's no letter "a" in the text, so it remains unchanged)
- **3:** `body`'s second child is `div` -> **recursive call**
- **3.1:** check if `div` is a text node; it's not, so loops through `div`'s child nodes
- **3.2:** `div`'s first child is `p` -> **recursive call**
- **3.2.1:** check if `p` is a text node; it's not, so loops through `p`'s child nodes
- **3.2.2:** `p`'s first child is the text node _"abc"_ -> **recursive call**
- **3.2.2.1:** check if it's a text node. **It is**, so replace its text (it becomes _"bc"_)
- **3.2.3:** `p`'s second child is `span` -> **recursive call**
- **3.2.3.1:** check if `span` is a text node
- **3.2.3.2:** it's not, so loops through `span`'s child nodes
- **3.2.3.3:** `span`'s first child is the text node _"def"_ -> **recursive call**
- **3.2.3.3.1:** check if it's a text node. **It is**, so replace its text (in this case, there's no letter "a" in the text, so it remains unchanged)
In step 1, we're at "level 1" (checking `document.body`).
Step 2 is when the first recursive call is made: we're inside the `for`, looping through `body`'s child nodes, and call `replaceIn(child)`. At the first loop iteration, `child` will be `a`, so `replaceIn` will receive this element as argument.
At step 2.2, we're at the `for` loop, but now it's looping through `a`'s child nodes, and we make another recursive call: `replaceIn` now receives the text node _"Whatever"_, check that it's a text node, and replace it. Then we can proceed to `a`'s next child, which is `i` (then another recursive call is made at step 2.3, and so on).
The same happens for all other nodes in the hiearchy. For each one, check its type. If it's a text node, replace; otherwise loop through its child nodes and repeat the same steps for each child (and the recursive call is the way to "repeat the same steps").
Each recursive call is made to "go into a new level" in the hiearchy (first to check `body`'s children, then to check its grandchildren, then the grand-grandchildren and so on). I've put numbers to indicate those (every new level adds a new number, such as the call in step 2 goes to 2.1).
With this, you don't need to know how many levels the DOM tree has. For each new level, it'll make another recursive call, until there are no more levels to check. In the end, all nodes will be visited, and all the text nodes are replaced.
---
**Important note:** actually, the function receives a node where the replacement will start. In the examples above, I'm calling the function as `replaceIn(document.body)`, so the algorithm starts at the `body` element and replace everything below it. But if I call `replaceIn(someOtherElement)`, the replacement will be made only in that specific element (and all the tags inside it), instead of the whole page.
---
> *can there be an even simpler and more direct version without the `else`?*
**For this particular case**, yes, you can remove the `else` clause:
```javascript
function replaceIn(e) {
if (e.nodeType == Node.TEXT_NODE) {
e.nodeValue = e.nodeValue.replaceAll("a", "");
}
for (const child of e.childNodes) {
replaceIn(child);
}
}
```
That's because text nodes don't have child nodes, so their `childNodes` property will always be empty (hence, for those nodes, there will be no elements to be traversed in the `for..of` loop, and the final result will be the same).
---
But there's an important aspect to consider.
`if`/`else` statements exist for a reason. When I do:
```javascript
if (condition) {
A
} else {
B
}
```
I'm saying that if the condition is true, do `A`, otherwise, do `B`. It's a choice between one or another, based on the condition.
But if I do:
```javascript
if (condition) {
A
}
B
```
Then `B` is **always** done, regardless of the condition being true or false. Removing the `else` statement is not a matter of being simpler or more direct, it can completely change the algorithm.
As I said, **for this particular case**, both ends up being equivalent, just because text nodes don't have child nodes.
Nevertheless, I still prefer to keep the `else` clause, because it conveys intention and semantics: if it's a text node, do this, if it's not, do that. It makes clear that the loop must not be made on text nodes.

#1: Initial revision by

hkotsubo‭ · 2022-04-29T16:19:33Z (over 2 years ago)

Copy Link

Raw

Markdown

The first part ("*How this recursive treewalker works?*") was already answered, so I'll focus on:

> *can there be an even simpler and more direct version without the `else`?*

**For this particular case**, yes, you can remove the `else` clause:

```javascript
function replaceIn(e) {
    if (e.nodeType == Node.TEXT_NODE) {
        e.nodeValue = e.nodeValue.replaceAll("a", "");
    } 
    for (const child of e.childNodes) {
        replaceIn(child);
    }
}
```

That's because text nodes don't have child nodes, so their `childNodes` property will always be empty (hence, for those nodes, there will be no elements to be traversed in the `for..of` loop, and the final result will be the same).

---

But there's an important aspect to consider.

`if`/`else` statements exist for a reason. When I do:

```javascript
if (condition) {
    A
} else {
    B
}
```

I'm saying that if the condition is true, do `A`, otherwise, do `B`. It's a choice between one or another, based on the condition.

But if I do:

```javascript
if (condition) {
    A
}
B
```

Then `B` is **always** done, regardless of the condition being true or false. Removing the `else` statement is not a matter of being simpler or more direct, it can completely change the algorithm.

As I said, **for this particular case**, both ends up being equivalent, just because text nodes don't have child nodes.

Nevertheless, I still prefer to keep the `else` clause, because it conveys intention and semantics: if it's a text node, do this, if it's not, do that. It makes clear that the loop must not be made on text nodes.

Communities

Post History