Web Scrapers in Python with Qt and WebKit
It's interesting to see how Qt turns out to be the easiest approach to wield the power of WebKit, even if it's without building a graphical user interface.
1 WebKit?
Web browsers are built around so-called layout engines which graphically render from HTML, JavaScript and all the rest of it web pages in a window, as you see them. WebKit is just one of the several layout engines on the market. It's used by many modern browsers, and more and more so as it's actually pretty good: it's fast, supports the largest variety of web sites, is well supported and happens to be easy to add in your applications with toolkits such as Qt.
What's tremendous if you do so is that it's not just about loading a page and parsing HTML to get to the information you want. It goes way beyond this: once loaded, you can interact with the page, click on elements, enter text, as you would do with a real browser. Only you can choose to do so programmatically.
2 PySide and WebKit
PySide is one solid way of making Qt applications with Python. One way I found sensible to build web scrapers with PySide and WebKit is by designing your program such that loading pages can be done in the background, since it's a process that can take time. Think of 2 threads:
- The main thread, which runs Qt. Note that the Qt widget must live in the main thread. The Qt thread must make data gathered from the web page available to the user interface thread. A shared dictionary between the 2 threads protected with a mutex is one way to go.
- The user interface thread. It will only request the Qt thread to load the web page, so the only way it will talk to it is by emitting signals.
It seems inconvenient to have each web page loaded in a thread of its own, since Qt widgets all have to live in the main thread. It's not necessary either since the QWebView.load()
function isn't blocking. A pool of WebView
s could be managed in the Qt thread. It would create new QWebView
s which don't exist yet, or reuse existing ones where the page has already been loaded or when it runs out of them.
I had once thought that the loadFinished
signal was what we needed to start looking for objects in a web page. Unfortunately, it occasionally gets fired even when the page hasn't fully been loaded. I haven't been able to find any other signal from QWebView
s, QWebPage
s or QWebFrame
s which would go off only once the page is fully loaded. So I use QTimer
s to periodically retry after the loadFinished
has fired.
3 Notes about Interacting with Web Pages
I've tried several versions of PySide but I've never seen the
QWebElement.evaluteJavaScript('click()')
call to work, even though it's advertised in the documentation. However, I've been much more successful with e.g:document.getElementsByTagName('button')[0].click();
Selectors don't always work as expected. You would normally expect
QWebFrame.findFirstElement('.button')
to findBUTTON
HTML elements but in fact it won't work and you are really expected to use:Note the absence of dot.QWebFrame.findFirstElement('button')
- ID selectors, i.e. with a leading
#
, won't work if they have numbers in them when usingQWebFrame.findFirstElement()
. I haven't been able to write to stdout with the JavaScript
console.log()
function. However,alert()
does work, if you need debugging. I suppose this requiresQWebView.show()
.