Rust library to scrape HTML documents with XPath expressions.
This library is major-version 0 as the API is still evolving. See the Supported XPath Features section for details.
Skyscraper has its own HTML parser implementation. The parser outputs a tree structure that can be traversed manually with parent/child relationships.
use skyscraper::html::{self, parse::ParseError};
let html_text = r##"
<html>
<body>
<div>Hello world</div>
</body>
</html>"##;
let document = html::parse(html_text)?;// Parse the HTML text into a document
let text = r#"<parent><child/><child/></parent>"#;
let document = html::parse(text)?;
// Get the children of the root node
let parent_node: DocumentNode = document.root_node;
let children: Vec<DocumentNode> = parent_node.children(&document).collect();
assert_eq!(2, children.len());
// Get the parent of both child nodes
let parent_of_child0: DocumentNode = children[0].parent(&document).expect("parent of child 0 missing");
let parent_of_child1: DocumentNode = children[1].parent(&document).expect("parent of child 1 missing");
assert_eq!(parent_node, parent_of_child0);
assert_eq!(parent_node, parent_of_child1);Skyscraper's HTML parser follows the WHATWG parsing specification. One notable consequence is implicit <tbody> insertion: when <tr>, <td>, or <th> elements appear as direct children of <table>, the parser automatically wraps them in a <tbody> element (per WHATWG §13.2.6.4.9). This matches browser behavior but differs from parsers like Python's lxml, which does not insert <tbody>. As a result, XPath expressions like //table/* or //table//* may return different results than lxml for the same input HTML. To avoid this discrepancy, use explicit <tbody> tags in your HTML or account for the implicit element in your XPath expressions.
Skyscraper is capable of parsing XPath strings and applying them to HTML documents.
Below is a basic xpath example. Please see the docs for more examples.
use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree, grammar::{XpathItemTreeNodeData, data_model::{Node, XpathItem}}};
use std::error::Error;
fn main() -> Result<(), Box<dyn Error>> {
let html_text = r##"
<html>
<body>
<div>Hello world</div>
</body>
</html>"##;
let document = html::parse(html_text)?;
let xpath_item_tree = XpathItemTree::from(&document);
let xpath = xpath::parse("//div")?;
let item_set = xpath.apply(&xpath_item_tree)?;
assert_eq!(item_set.len(), 1);
let mut items = item_set.into_iter();
let item = items
.next()
.unwrap();
let element = item
.as_node()?
.as_tree_node()?
.data
.as_element_node()?;
assert_eq!(element.name, "div");
Ok(())
}Below is a non-exhaustive list of all the features that are currently supported.
- Basic xpath steps:
/html/body/div,//div/table//span - Attribute selection:
//div/@class - Text selection:
//div/text() - Wildcard node selection:
//body/* - Predicates:
- Attributes:
//div[@class='hi'] - Indexing:
//div[1] - Arbitrary expressions:
//div[contains(@class, 'hi')]
- Attributes:
- Forward axes:
child::,descendant::,attribute::,self::,descendant-or-self::,following-sibling::,following::,namespace:: - Reverse axes:
parent::,ancestor::,preceding-sibling::,preceding::,ancestor-or-self:: - Operators:
- Logical:
and,or - Comparison:
=,!=,<,>,<=,>=,eq,ne,lt,gt,le,ge - Arithmetic:
+,-,*,div,idiv,mod - String concatenation:
|| - Sequence:
union/|,intersect,except,to - Simple map:
! - Arrow:
=> - Node comparison:
is,<<,>>
- Logical:
- Expressions:
if/then/else,for,let,some/every(quantified) - Type expressions:
instance of,cast as,castable as,treat as - Functions (100+): string (
contains,starts-with,ends-with,substring,concat,normalize-space,upper-case,lower-case,translate,matches,replace,tokenize, ...), numeric (round,floor,ceiling,abs,sum,avg,min,max, ...), boolean (not,true,false,boolean), sequence (count,empty,exists,reverse,sort,distinct-values,head,tail,subsequence, ...), node (name,local-name,root,path,has-children,data, ...), higher-order (for-each,filter,fold-left,fold-right, ...), and more - Maps and arrays construction and access
If your use case requires an unimplemented feature, please open an issue on GitHub.
See docs/features-backlog.md for a detailed list of spec gaps, known limitations, and design decisions.
