Jérôme Belleman
Home  •  Tools  •  Posts  •  Talks  •  Travels  •  Graphics  •  About Me

Web Scrapers in Python with Qt and WebKit

25 Sep 2013

It's interesting to see how Qt turns out to be the easiest approach to wield the power of WebKit, even if it's without building a graphical user interface.

1 WebKit?

Web browsers are built around so-called layout engines which graphically render from HTML, JavaScript and all the rest of it web pages in a window, as you see them. WebKit is just one of the several layout engines on the market. It's used by many modern browsers, and more and more so as it's actually pretty good: it's fast, supports the largest variety of web sites, is well supported and happens to be easy to add in your applications with toolkits such as Qt.

What's tremendous if you do so is that it's not just about loading a page and parsing HTML to get to the information you want. It goes way beyond this: once loaded, you can interact with the page, click on elements, enter text, as you would do with a real browser. Only you can choose to do so programmatically.

2 PySide and WebKit

PySide is one solid way of making Qt applications with Python. One way I found sensible to build web scrapers with PySide and WebKit is by designing your program such that loading pages can be done in the background, since it's a process that can take time. Think of 2 threads:

It seems inconvenient to have each web page loaded in a thread of its own, since Qt widgets all have to live in the main thread. It's not necessary either since the QWebView.load() function isn't blocking. A pool of WebViews could be managed in the Qt thread. It would create new QWebViews which don't exist yet, or reuse existing ones where the page has already been loaded or when it runs out of them.

I had once thought that the loadFinished signal was what we needed to start looking for objects in a web page. Unfortunately, it occasionally gets fired even when the page hasn't fully been loaded. I haven't been able to find any other signal from QWebViews, QWebPages or QWebFrames which would go off only once the page is fully loaded. So I use QTimers to periodically retry after the loadFinished has fired.

3 Notes about Interacting with Web Pages

4 References