Sunbelt Computer Software

Skyscraper - HTML scraping with XPath

Rust library to scrape HTML documents with XPath expressions.

This library is major-version 0 as the API is still evolving. See the Supported XPath Features section for details.

HTML Parsing

Skyscraper has its own HTML parser implementation. The parser outputs a tree structure that can be traversed manually with parent/child relationships.

Example: Simple HTML Parsing

use skyscraper::html::{self, parse::ParseError};
let html_text = r##"
<html>
    <body>
        <div>Hello world</div>
    </body>
</html>"##;
 
let document = html::parse(html_text)?;

Example: Traversing Parent/Child Relationships

// Parse the HTML text into a document
let text = r#"<parent><child/><child/></parent>"#;
let document = html::parse(text)?;
 
// Get the children of the root node
let parent_node: DocumentNode = document.root_node;
let children: Vec<DocumentNode> = parent_node.children(&document).collect();
assert_eq!(2, children.len());
 
// Get the parent of both child nodes
let parent_of_child0: DocumentNode = children[0].parent(&document).expect("parent of child 0 missing");
let parent_of_child1: DocumentNode = children[1].parent(&document).expect("parent of child 1 missing");
 
assert_eq!(parent_node, parent_of_child0);
assert_eq!(parent_node, parent_of_child1);

WHATWG Compliance Note

Skyscraper's HTML parser follows the WHATWG parsing specification. One notable consequence is implicit <tbody> insertion: when <tr>, <td>, or <th> elements appear as direct children of <table>, the parser automatically wraps them in a <tbody> element (per WHATWG §13.2.6.4.9). This matches browser behavior but differs from parsers like Python's lxml, which does not insert <tbody>. As a result, XPath expressions like //table/* or //table//* may return different results than lxml for the same input HTML. To avoid this discrepancy, use explicit <tbody> tags in your HTML or account for the implicit element in your XPath expressions.

XPath Expressions

Skyscraper is capable of parsing XPath strings and applying them to HTML documents.

Below is a basic xpath example. Please see the docs for more examples.

use skyscraper::html;
use skyscraper::xpath::{self, XpathItemTree, grammar::{XpathItemTreeNodeData, data_model::{Node, XpathItem}}};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    let html_text = r##"
    <html>
        <body>
            <div>Hello world</div>
        </body>
    </html>"##;

    let document = html::parse(html_text)?;
    let xpath_item_tree = XpathItemTree::from(&document);
    let xpath = xpath::parse("//div")?;
   
    let item_set = xpath.apply(&xpath_item_tree)?;
   
    assert_eq!(item_set.len(), 1);
   
    let mut items = item_set.into_iter();
   
    let item = items
        .next()
        .unwrap();

    let element = item
        .as_node()?
        .as_tree_node()?
        .data
        .as_element_node()?;

    assert_eq!(element.name, "div");
    Ok(())
}

Supported XPath Features

Below is a non-exhaustive list of all the features that are currently supported.

Basic xpath steps: /html/body/div, //div/table//span
Attribute selection: //div/@class
Text selection: //div/text()
Wildcard node selection: //body/*
Predicates:
1. Attributes: //div[@class='hi']
2. Indexing: //div[1]
3. Arbitrary expressions: //div[contains(@class, 'hi')]
Forward axes: child::, descendant::, attribute::, self::, descendant-or-self::, following-sibling::, following::, namespace::
Reverse axes: parent::, ancestor::, preceding-sibling::, preceding::, ancestor-or-self::
Operators:
1. Logical: and, or
2. Comparison: =, !=, <, >, <=, >=, eq, ne, lt, gt, le, ge
3. Arithmetic: +, -, *, div, idiv, mod
4. String concatenation: ||
5. Sequence: union/|, intersect, except, to
6. Simple map: !
7. Arrow: =>
8. Node comparison: is, <<, >>
Expressions: if/then/else, for, let, some/every (quantified)
Type expressions: instance of, cast as, castable as, treat as
Functions (100+): string (contains, starts-with, ends-with, substring, concat, normalize-space, upper-case, lower-case, translate, matches, replace, tokenize, ...), numeric (round, floor, ceiling, abs, sum, avg, min, max, ...), boolean (not, true, false, boolean), sequence (count, empty, exists, reverse, sort, distinct-values, head, tail, subsequence, ...), node (name, local-name, root, path, has-children, data, ...), higher-order (for-each, filter, fold-left, fold-right, ...), and more
Maps and arrays construction and access

If your use case requires an unimplemented feature, please open an issue on GitHub.

See docs/features-backlog.md for a detailed list of spec gaps, known limitations, and design decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 317 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.vscode		.vscode
benches		benches
docs		docs
examples		examples
named_character_ref_generator		named_character_ref_generator
proptest-regressions/xpath/grammar		proptest-regressions/xpath/grammar
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skyscraper - HTML scraping with XPath

HTML Parsing

Example: Simple HTML Parsing

Example: Traversing Parent/Child Relationships

WHATWG Compliance Note

XPath Expressions

Supported XPath Features

About

Uh oh!

Releases 13

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Sunbelt Computer Software

PL/B Language Development and Support

Folders and files

Latest commit

History

Repository files navigation

Skyscraper - HTML scraping with XPath

HTML Parsing

Example: Simple HTML Parsing

Example: Traversing Parent/Child Relationships

WHATWG Compliance Note

XPath Expressions

Supported XPath Features

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages