Overview¶
spatula is a modern Python library for writing maintainable web scrapers.
Source: https://codeberg.org/jpt/spatula/
Documentation: https://spatula.jpt.sh
Issues: https://codeberg.org/jpt/spatula/issues
Features¶
- Page-oriented design: Encourages writing understandable & maintainable scrapers.
 - Not Just HTML: Provides built in handlers for common data formats including CSV, JSON, XML, PDF, and Excel. Or write your own.
 - Fast HTML parsing: Uses 
lxml.htmlfor fast, consistent, and reliable parsing of HTML. - Flexible Data Model Support: Compatible with 
dataclasses,attrs,pydantic, or bring your own data model classes for storing & validating your scraped data. - CLI Tools: Offers several CLI utilities that can help streamline development & testing cycle.
 - Fully Typed: Makes full use of Python 3 type annotations.
 
Installation¶
spatula is on PyPI, and can be installed via any standard package management tool:
poetry add spatula
or:
pip install spatula
Example¶
An example of a fairly simple two-page scrape, read A First Scraper for a walkthrough of how it was built.
from spatula import HtmlPage, HtmlListPage, CSS, XPath, SelectorError
class EmployeeList(HtmlListPage):
    # by providing this here, it can be omitted on the command line
    # useful in cases where the scraper is only meant for one page
    source = "https://scrapple.fly.dev/staff"
    # each row represents an employee
    selector = CSS("#employees tbody tr")
    def process_item(self, item):
        # this function is called for each <tr> we get from the selector
        # we know there are 4 <tds>
        first, last, position, details = item.getchildren()
        return EmployeeDetail(
            dict(
                first=first.text,
                last=last.text,
                position=position.text,
            ),
            source=XPath("./a/@href").match_one(details),
        )
    def get_next_source(self):
        try:
            return XPath("//a[contains(text(), 'Next')]/@href").match_one(self.root)
        except SelectorError:
            pass
class EmployeeDetail(HtmlPage):
    def process_page(self):
        status = CSS("#status").match_one(self.root)
        hired = CSS("#hired").match_one(self.root)
        return dict(
            status=status.text,
            hired=hired.text,
            # self.input is the data passed in from the prior scrape
            **self.input,
        )
    def process_error_response(self, exc):
        self.logger.warning(exc)