Requirements for a web agent tool

The goal for the tool is to be capable of both viewing a webpage and executing interactions such as clicking or typing. When viewing a webpage only visible parts of the webpage should be displayed, and metadata about interactivity has to be added:
In this case, tags that indicate whether an element can be clicked (<clickable>) or if it is a text input (<typeable>).
Since these are the most common interactions, we expect them to enable the usage of many websites. To enable these interactions, we have built two separate tools that locate elements based on the text they contain and execute clicking or typing interactions. Note, however, that there are other types of input form interactions that we do not cover here.

Parsing a webpage

We chose to parse the website with a custom JS script since it is more suitable than using Python's Beautifulsoup for multiple reasons:
- JS is faster and can later be executed on the webpage itself using playwright.
- We can access the position of objects, allowing for occlusion checks.
- The DOM can already be traversed without separate parsing.
The parser can also be found on GitHub.

Occlusion Checks

Let's start with the occlusion check: As there is no built-in JS function to check if an element is occluded, we'll implement a custom algorithm.
To do that, we'll determine the center coordinate of the element we want to check and find the topmost element at that coordinate.
To avoid that very small elements occlude much bigger ones, we calculate the area of intersection and only count an element as occluded when at least 50% of its area is occluded.


/*
* 
* 
* 
* We calculate the edges of the box of intersection (so we can then calculate width and height and ultimately the area):
* 
*         |--width-|
* 
*   +--------------+
*   |              |
*   |     +--top---+--------+   –
*   |     |########|        |   |
*   |     |########|        |   |
*   | left|########|right   |   | height
*   |     |########|        |   |
*   |     |########|        |   |
*   +-----+-bottom-+        |   –
*         |                 |
*         +-----------------+
* 
* height = bottom - top
* width = right-left
* 
*/
function isOccluded(el) {
	if(el.nodeType === Node.ELEMENT_NODE && el.ownerDocument === document){//Check if it's even an element node, since content inside nested documents has a different coordinate system they are ignored
		const rect = el.getBoundingClientRect();//get the rectangle the node occupies
		let elementArea = rect.width*rect.height//calculate its area...
		const centerX = rect.left + rect.width / 2;//...and center coordinate
		const centerY = rect.top + rect.height / 2;
		const topElement = document.elementFromPoint(centerX, centerY);//get the top element at the center coordinate
		if(topElement&& topElement.nodeName!="A"){//a tags (links) should not occlude other elements as they are sometimes overlayed to make elements clickable
			if(!el.contains(topElement) && !topElement.contains(el)){//if they are not a child of each other...
				//calculate intersection box:
				let topRect = topElement.getBoundingClientRect();//get the rectangle the top element occupies
				let left = Math.max(rect.left, topRect.left)
				let right = Math.min(rect.right, topRect.right)
				let top = Math.max(rect.top, topRect.top)
				let bottom = Math.min(rect.bottom, topRect.bottom)
				if(right<left || bottom<top) return false//check if the intersection box is invalid, return false if it is
				let intersectArea = (right-left)*(bottom-top)//get the area of the intersection
				return intersectArea/elementArea>0.5//treat element as occluded when over 50% of its are is occluded
			}
		}
		
	}
	return false;
}

Detect clickable and typeable elements

Let's move on to annotation of clickability and typeability. There is - again- no simple built-in tool for that. We choose to determine those attributes using CSS selectors since they're quite fast and easy to implement.
The queries likely won't cover all cases but at least a broad range, as they even check role attributes.
This two functions check whether a node is clickable or typeable:


function isClickable(node) {//use a css query to match common objects that can be clicked (there may be unchecked edge cases)
	if(node.nodeType === Node.ELEMENT_NODE){
		return node.matches('a, button, select, option, area, input[type="submit"], input[type="button"], input[type="reset"], input[type="radio"], input[type="checkbox"], [role="button"], [role="link"], [role="checkbox"], [role="menuitemcheckbox"] , [role="menuitemradio"], [role="option"], [role="radio"], [role="switch"], [role="tab"], [role="treeitem"], [onclick]')
	}else{
		return false
	}
}

function isTypeable(node) {//use a css query to match common objects one can type in (there may be unchecked edge cases)
	if(node.nodeType === Node.ELEMENT_NODE){
		return node.matches('input[type="text"],input[type="password"],input[type="email"],input[type="search"],input[type="tel"],input[type="url"],input[type="number"],textarea,[role="textbox"], [role="search"], [role="searchbox"], [role="combobox"], [onkeydown],[onkeypress],[onkeyup]')
	}else{
		return false
	}
}

Annotating the DOM tree

Now that we have implemented the most important checks, we can put all this together into one recursive parser.
The parser will start at the root node and recursively annotate and copy its children, their children, etc., until the whole DOM is traversed.
Here's what it should do (step-by-step):
1. Find out if the current element is invisible or undefined (if so, exit).
2. Determine if the element is clickable or typeable.
3. Create a new node that will be used to copy the original node. Its tag name will be clickable, typeable, or unwrap (it's hard to unwrap nodes directly, so we'll do that later with regex).
4. Add the children of the original element to the copy.
There are two edge cases that have to be handled in a special way here: Shadow DOM and iFrames.
5. If the element does not contain any text, we'll find a text based on attributes.

Here's the code:


function annotateNode(element) {//recursive main loop: Creates a copy of the DOM, incorporates shadow DOM, removes invisible/occluded elements, annotates usability of elements and replaces all other tags with 
	//Step 1: Is the lement undefined, invisible otr a TextNode?
	if(!element){//If something is None/undefined its replaced with an empty TextNode
		return document.createTextNode("")
	}else if(element.nodeType== Node.TEXT_NODE){//Text nodes are copied as is:
			return document.createTextNode(element.textContent)
	}else if(element.nodeType === Node.ELEMENT_NODE){//Remove elements that are invisible, occluded or simply invisible by default (e.g. scripts):
		if(["SCRIPT", "STYLE","META","LINK","NOSCRIPT"].includes(element.nodeName) || isOccluded(element) || !element.checkVisibility({opacityProperty: true, visibilityProperty : true})) {
			return document.createTextNode("")
		}
	}
	
	//Step 2+3: Determine the elements name and create the node for the copy
	let name =  isClickable(element) ? "clickable": isTypeable(element) ? "typeable" : "unwrap"//decide whether the current node can be clicked, typed or if it should later be unwrapped (unwrap)
	
	let newNode = document.createElement(name);//create a node copy to add the content of the original to
	
	//Step 4: Add the children to the copy node
	//Normal childNodes:
	for (let node of element.childNodes) {//add all children of the original node but also annotate them
		newNode.appendChild(annotateNode(node))
	}
	
	//childNodes inside an iFrame:
	if(element.nodeName=="IFRAME"){//if the original node is an iFrame add all its content too:
			try{
				let doc = element.contentDocument.body || element.contentWindow.document.body//find the iframes body as a root to start from
				let nodes = doc.querySelectorAll(":scope > *")//select all direct children of the root element
				for (let node of nodes){//annotate and add the nodes to the node copy
					newNode.appendChild(annotateNode(node))
				}
			}catch(e){
				console.log("Couldn't unwrap iFrame. Likely due to a cross-origin request issue: ",e)
			}
	}
	
	//childNodes inside shadowDOM:
	if (element.shadowRoot) {//if the lement is a shadow dom element add its children too
		for (let node of element.shadowRoot.childNodes) {//annotate and add children of shadow dom root
			newNode.appendChild(annotateNode(node))
		}
	}
	
	//Step 5: add text to empty elements based on their attributes
	if(newNode.innerText.trim().replaceAll(" ","") == ""){
		newNode.innerText = "";
		let replacer = element.title || element.ariaLabel || element.placeholder || element.alt;
		if(replacer){
			newNode.appendChild(document.createTextNode(replacer))
		}
	}
	
	return newNode
}

Cleaning the results and putting everything together

While we could - in theory - apply this function to the body of the document, the result will still be a mess.
The unwrap tags, for example, are still in there.
As we said earlier, we will remove them (and clean up the result in general) with regex.
Here's the code (with an additional getAllHTML() function for convenience):

function cleanUp(res){//clean up an annotated html
  res = res.replaceAll("<unwrap>","<br>").replaceAll("</unwrap>","<br>")//remove all unwrap tags (not their content)
  res = res.replaceAll(/<br>/g, '\\n');//replace <br> tags with \\n
  res = res.replaceAll('&nbsp;', ' ');//replace &nbsp; with blanks
  res = res.replaceAll(/<clickable>\s*<\/clickable>/g, '');//remove empty clickable tags
  res = res.replaceAll(/<typeable>\s*<\/typeable>/g, '');//remove empty typeable tags
  res = res.replaceAll(/\\n\s+/g, '\\n');//replace multiple \\n with just one \\n
  res = res.replaceAll(/ {2,}/g, ' ')//replace more than two blanks with just one blank
  return res
}
function getAllHTML() {//function to retrieve the current page as annotated text
  try{
      return cleanUp(annotateNode(document.body).outerHTML)
  }catch{
     return cleanUp(annotateNode(document.body))
  }
}

You can now execute getAllHTML on any website to retrieve a parsed version.
If you don't want the annotation tags, you can use the text-only version on GitHub.

Examples for parsed pages

Below is a parsed version of google.com:

<clickable>About</clickable><clickable>Store</clickable>
<clickable>Gmail</clickable>
<clickable>Images</clickable>
<clickable>
Sign in
</clickable>
<typeable> 
<typeable>Search</typeable>
<clickable>Google Search</clickable> <clickable>I'm Feeling Lucky</clickable>
</typeable>
Google offered in: <clickable>Deutsch</clickable> 
Germany
<clickable>Advertising</clickable><clickable>Business</clickable><clickable> How Search works </clickable>
<clickable>Privacy</clickable><clickable>Terms</clickable>
<clickable>
Settings
</clickable>

As one can see, the visible text is successfully extracted, and the interactive tags (e.g. the search field) are annotated. Invisible text like the list of services is not displayed.

Heres the beginning of apple.com (as of 25 September 2025):

<clickable>
Apple
</clickable>
<clickable>
Search apple.com</clickable>
<clickable>
Shopping Bag</clickable>
0
+
Last chance to get AirPods or an eligible accessory of your choice when you buy Mac or iPad with education savings. Ends 9.30.
<clickable>1</clickable>
<clickable>Shop
</clickable>
<clickable> </clickable>
iPhone 17 Pro in cosmic orange finish, Pro Fusion camera system, 3 lenses, microphone, flash
<clickable> </clickable>
iPhone Air
The thinnest iPhone ever. With the power of pro inside.
<clickable>Learn more</clickable>
<clickable>Buy</clickable>
Side view of iPhone Air, showing very thin titanium side
<clickable> </clickable>
iPhone 17
Magichromatic.
<clickable>Learn more</clickable>
<clickable>Buy</clickable>

While there are some issues (some of the menu items are not text but SVGs and do not have title tags or aria labels, which makes it impossible to retrieve a textual representation), this is somewhat correct.
Now that we have a functional website viewer, we can move on to integrating this with Playwright and a Python backend.

Using websites with Playwright

The logical next step is making a tool to interact with websites that uses our parser.
(Note: If you haven't installed Playwright and Firefox yet run
pip install pytest-playwright and
playwright install firefox.)

Imports and variables

Let's start with importing Playwright and defining some variables that we'll need later:
The JS parser (parse_script) and two simplified CSS queries (is_clickable and is_typeable).

We obviously need the parser to execute it.
The two CSS queries are for more accurate location of elements when executing clicking and typing actions.

from playwright.sync_api import sync_playwright, expect# Import playwright. Note: You may have to install it first using "pip install pytest-playwright"

#define JavaScript parser
parse_script = """
() => {
function isOccluded(el) {
	if(el.nodeType === Node.ELEMENT_NODE && el.ownerDocument === document){//Check if it's even an element node, since content inside nested documents has a different coordinate system they are ignored
		const rect = el.getBoundingClientRect();//get the rectangle the node occupies
		let elementArea = rect.width*rect.height//calculate its area...
		const centerX = rect.left + rect.width / 2;//...and center coordinate
		const centerY = rect.top + rect.height / 2;
		const topElement = document.elementFromPoint(centerX, centerY);//get the top element at the center coordinate
		if(topElement&& topElement.nodeName!="A"){//a tags (links) should not occlude other elements as they are sometimes overlayed to make elements clickable
			if(!el.contains(topElement) && !topElement.contains(el)){//if they are not a child of each other...
				//calculate intersection box:
				let topRect = topElement.getBoundingClientRect();//get the rectangle the top element occupies
				let left = Math.max(rect.left, topRect.left)
				let right = Math.min(rect.right, topRect.right)
				let top = Math.max(rect.top, topRect.top)
				let bottom = Math.min(rect.bottom, topRect.bottom)
				if(right<left || bottom<top) return false//check if the intersection box is invalid, return false if it is
				let intersectArea = (right-left)*(bottom-top)//get the area of the intersection
				return intersectArea/elementArea>0.5//treat element as occluded when over 50% of its are is occluded
			}
		}
		
	}
	return false;
}

function isClickable(node) {//use a css query to match common objects that can be clicked (there may be unchecked edge cases)
	if(node.nodeType === Node.ELEMENT_NODE){
		return node.matches('a, button, select, option, area, input[type="submit"], input[type="button"], input[type="reset"], input[type="radio"], input[type="checkbox"], [role="button"], [role="link"], [role="checkbox"], [role="menuitemcheckbox"] , [role="menuitemradio"], [role="option"], [role="radio"], [role="switch"], [role="tab"], [role="treeitem"], [onclick]')
	}else{
		return false
	}
}

function isTypeable(node) {//use a css query to match common objects one can type in (there may be unchecked edge cases)
	if(node.nodeType === Node.ELEMENT_NODE){
		return node.matches('input[type="text"],input[type="password"],input[type="email"],input[type="search"],input[type="tel"],input[type="url"],input[type="number"],textarea,[role="textbox"], [role="search"], [role="searchbox"], [role="combobox"], [onkeydown],[onkeypress],[onkeyup]')
	}else{
		return false
	}
}


function annotateNode(element) {//recursive main loop: Creates a copy of the DOM, incorporates shadow DOM, removes invisible/occluded elements, annotates usability of elements and replaces all other tags with 
	if(!element){//If the lement is None/undefined it is replaced with an empty TextNode
		return document.createTextNode("")
	}else if(element.nodeType== Node.TEXT_NODE){//Text nodes are copied as is:
			return document.createTextNode(element.textContent)
	}else if(element.nodeType === Node.ELEMENT_NODE){//Remove elements that are invisible, occluded or simply invisible by default (e.g. scripts):
		if(["SCRIPT", "STYLE","META","LINK","NOSCRIPT"].includes(element.nodeName) || isOccluded(element) || !element.checkVisibility({opacityProperty: true, visibilityProperty : true})) {
			return document.createElement("br")
		}
	}
	
	let name =  isClickable(element) ? "clickable": isTypeable(element) ? "typeable" : "unwrap"//decide whether the current node can be clicked, typed or if it should later be unwrapped (unwrap)
	
	let newNode = document.createElement(name);//create a node copy to add the content of the original to
	
	//Normal childNodes:
	for (let node of element.childNodes) {//add all children of the original node but also annotate them
		newNode.appendChild(annotateNode(node))
	}
	
	//childNodes inside an iFrame:
	if(element.nodeName=="IFRAME"){//if the original node is an iFrame add all its content too:
			try{
				let doc = element.contentDocument.body || element.contentWindow.document.body//find the iframes body as a root to start from
				let nodes = doc.querySelectorAll(":scope > *")//select all direct children of the root element
				for (let node of nodes){//annotate and add the nodes to the node copy
					newNode.appendChild(annotateNode(node))
				}
			}catch(e){
				console.log("Couldn't unwrap iFrame. Likely due to a cross-origin request issue: ",e)
			}
	}
	
	//childNodes inside shadowDOM:
	if (element.shadowRoot) {//if the lement is a shadow dom element add its children too
		for (let node of element.shadowRoot.childNodes) {//annotate and add children of shadow dom root
			newNode.appendChild(annotateNode(node))
		}
	}
	
	//add text to empty elements based on their attributes
	if(newNode.innerText.trim().replaceAll(" ","") == ""){
		newNode.innerText = "";
		let replacer = element.title || element.ariaLabel || element.placeholder || element.alt;
		if(replacer){
			newNode.appendChild(document.createTextNode(replacer))
		}
	}
	
	return newNode
}
function cleanUp(res){//clean up an annotated html
  res = res.replaceAll("<unwrap>","<br>").replaceAll("</unwrap>","<br>")//remove all unwrap tags (not their content)
  res = res.replaceAll(/<br>/g, '\\n');//replace <br> tags with \\n
  res = res.replaceAll('&nbsp;', ' ');//replace &nbsp; with blanks
  res = res.replaceAll(/<clickable>\s*<\/clickable>/g, '');//remove empty clickable tags
  res = res.replaceAll(/<typeable>\s*<\/typeable>/g, '');//remove empty typeable tags
  res = res.replaceAll(/\\n\s+/g, '\\n');//replace multiple \\n with just one \\n
  res = res.replaceAll(/ {2,}/g, ' ')//replace more than two blanks with just one blank
  return res
}
function getAllHTML() {//function to retrieve the current page as annotated text
  try{
      return cleanUp(annotateNode(document.body).outerHTML)
  }catch{
     return cleanUp(annotateNode(document.body))
  }
}
return getAllHTML();
}"""

is_clickable = ':is(a, button, select, option, area, input, [role], [onclick])'#fast css query to find clickable elements
is_typeable = ':is(input,textarea,[role], [onkeydown],[onkeypress],[onkeyup])'#fast css query to find elements one can type in

Browser and Page

Now we still need a browser and page instance:


p = sync_playwright().start()
firefox = p.firefox#get firefox instance. Note: You may have to install firefox with "playwright install firefox"
browser = firefox.launch()#Open firefox. To view the browser window pass headless=False to the launch function
page = browser.new_page()#Creates a new page instance

CSS query generator for text search

And for locating elements whose text description is only contained inside attributes, we'll write a function to create custom CSS queries on the fly:


def text_selector(text):#Creates a css selector that searches for elements with the specified text contained in descriptive attributes
    text = repr(text)
    return ':is([aria-label = '+text+' i],[title = '+text+' i],[alt = '+text+' i], [placeholder = '+text+' i], [value = '+text+' i])'

Visiting URLs

However we can't open pages yet, so let's write a function for that too:


def open_page(url):#Opens an url
    page.goto(url=url)

Executing the JS parser with Playwright

And we still have to write a function for executing the parser:


def get_info():#Executes the JavaScript website parser to get an annotated representation of the website
    expect(page.locator("css=body")).to_be_visible()#Wait for body to be visible (=Wait for page to finish loading)
    res = page.evaluate(parse_script)#Execute js parser
    return res

Executing clicking and typing actions

The most important part is actually interacting with the web page.
That part will be done by the two functions below (which use the CSS queries we defined at the beginning):


def click(text):
	try:
		elements = page.get_by_text(text).or_(page.locator("css={}".format(text_selector(text))))#Select all elements that contain the text
		button = elements.and_(page.locator("css={}".format(is_clickable))).first#Of the elements that contain the text, select the first that is clickable
		button.click(force=True)#Click the clickable element
	except Exception as e:
		print("Clicking '{}' failed!".format(text))
		print(e)

def type(text,to_type,press_enter=True):
	try:
		elements = page.get_by_text(text).or_(page.locator("css={}".format(text_selector(text))))#Find all elements that contain the text
		field = elements.and_(page.locator("css={}".format(is_typeable))).first#Of the elements that contain the text select the first one that is an text input field
		field.type(to_type)#Type into the text input field
		if press_enter:#Press enter if necessary:
			page.keyboard.press("Enter")
	except Exception as e:
		print("Typing into '{}' failed!".format(text))
		print(e)

A short demo

Let's test the code:


open_page("https://www.example.com")
print(get_info())
click("More information")
print(get_info())

The output should look like this:

Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
<clickable>More information...</clickable>


<clickable>
Homepage
</clickable>
<clickable>Domains</clickable>
<clickable>Protocols</clickable>
<clickable>Numbers</clickable>
<clickable>About</clickable>
Example Domains
As described in <clickable>RFC 2606</clickable> and <clickable>RFC 6761</clickable>, a
number of domains such as example.com and example.org are maintained
for documentation purposes. These domains may be used as illustrative
examples in documents without prior coordination with us. They are not
available for registration or transfer.
We provide a web service on the example domain hosts to provide basic
information on the purpose of the domain. These web services are
provided as best effort, but are not designed to support production
applications. While incidental traffic for incorrectly configured
applications is expected, please do not design applications that require
the example domains to have operating HTTP service.
Further Reading
<clickable>IANA-managed Reserved Domains</clickable>
Last revised 2017-05-13.
<clickable>Domain Names</clickable>
<clickable>Root Zone Registry</clickable>
<clickable>.INT Registry</clickable>
<clickable>.ARPA Registry</clickable>
<clickable>IDN Repository</clickable>
<clickable>Number Resources</clickable>
<clickable>Abuse Information</clickable>
<clickable>Protocols</clickable>
<clickable>Protocol Registries</clickable>
<clickable>Time Zone Database</clickable>
<clickable>About Us</clickable>
<clickable>News</clickable>
<clickable>Performance</clickable>
<clickable>Excellence</clickable>
<clickable>Archive</clickable>
<clickable>Contact Us</clickable>
The IANA functions coordinate the Internet’s globally unique identifiers, and
are provided by <clickable>Public Technical Identifiers</clickable>, an affiliate of
<clickable>ICANN</clickable>.
<clickable>Privacy Policy</clickable>
<clickable>Terms of Service</clickable>

I hope you liked this tutorial! We plan to publish a second part that shows how to integrate this code base with the Ollama Python Api soon.
Stay tuned and have fun! You can find the complete code on GitHub.

web agents