CrawlBot Wars

Everybody who ever wanted to write a “successful website” (or more recently, thanks to the Web 2.0 hype, a “successful blog”) knows the bless and curse of crawlers, or bots, that are unleashed by all kind of entities to scan the web, and report the content back to their owners.

Most of these crawlers are handled by search engines, such as Google, Microsoft Live Search, Yahoo! and so on. With the widespread use of feeds, at least Google and Yahoo! added to their standard crawler bots also feed-specific crawlers that are used to aggregate blogs and other feeds into nice interfaces for their users (think Google Reader). Together with this kind of crawlers, though, there are less useful, sometimes nastier crawlers that either don’t respond to search engines, or respond to search engines whose ethical involvement makes somewhat wonder.

Good or bad, at the end of the day you might not want some bots to crawl your site; some Free Software -bigots- activists some time ago wanted, for instance, to exclude the Microsoft bot from their sites (while I have some other ideas), but there are certain bots that are even more useful to block, like the so-called “marketing bots”.

You might like Web 2.0 or you might not, but certainly lots of people found the new paradigm of Web as a gold mind to make more money out of content others have written – incidentally these are not, like RIAA, MPAA and SIAE insist, the “pirates” that copy music and movies, but rather companies whose objective is to provide other companies with marketing research and data based on content of blogs and similar services. While some people might be interested in getting their blog scanned by these crawlers either way, I’d guess that for most users who host their own blog this is just a waste of bandwidth: the crawlers tend to be quite pernicious since they don’t use If-Modified-Since or Etag headers in their request, and even when they do, they tend to make quite a few requests on the feeds per hour (compare this with Google’s Feedfetcher bot that requires at most one copy of the same feed per hour – well, if it isn’t confused by multiple compatibility redirects like it unfortunately is with my main blog).

While there is a voluntary exclusion protocol (represented by the omni-present robots.txt file), only actually “good” robots do consider that, while evil or rogue robots can simply ignore it. Also, it might be counter-productive to block rogue robots even when they do look at it. Say that a rogue robot wants your data, and to pass as a good one is advertising itself in the User-Agent string, complete with a link to a page explaining what it’s supposedly be doing, and accepting the exclusion. If you exclude it in robots.txt you can give it enough information to choose a _different_ User-Agent string that is not listed in the exclusion protocol.

One way to deal with the problem is by blocking the requests at the source, answering straight away with an HTTP 403 (Access Denied) on the web server when making a request. When using the Apache web server, the easiest way to do this is by using modsecurity and a blacklist rule for rogue robots, similar to the antispam system I’ve been using for a few months already. The one problem I see with this is that Apache’s mod_rewrite seem to be executed _before_ mod_security, which means that for any request that is rewritten by compatibility rules (moved, renamed, …) there is first a 301 response and just after that an actual 403.

I’m currently working on compiling such a blacklist by analysing the logs of my server, the main problem is deciding which crawlers to block and which to keep. When the description page explicitly states they are marketing research, blocking them is quite straightforward; when they _seem_ to provide an actual search service, that’s more shady, and it turns down to checking the behaviour of the bot itself on the site. And then there are the vulnerability scanners.

Still, it doesn’t stop here: given that in the Google description of GoogleBot they provide a (quite longish to be honest) method to verify that a bot is actually GoogleBot as it advertises itself to be, one has to assume that there are rogue bots out there trying to pass for GoogleBot or other good and lecit bot. This is very likely the case because some website that are usually visible only by registered users make an exception for search engine crawlers to access and index their content.

Especially malware, looking for backdoors into a web application, is likely to forge the User-Agent of a known good search engine bot (that is likely _not_ blocked by the robots.txt exclusion list), so that it doesn’t fire up any alarm in the logs. So finding “fake” search engine bots is likely to be an important step in securing a webserver running webapplications, may them be trusted or not.

As far as I know there is currently no way in Apache to check that a request actually does come from the bot it’s declared to come from. The nslookup method that Google suggests works fine for a forensic analysis but it’s almost impossible to perform properly with Apache itself, and not even modsecurity, by itself, can do much about that. On the other hand, there is one thing in the recent 2.5 versions of modsecurity that can be probably used to implement an actually working check: the LUA scripts loading. Which is what I’m going to work on as soon as I find some extra free time.

Force tw.jquery to include jquery for you and avoid double inclusion if you are using a jquery widget

One big problem when facing ToscaWidgets is when you have to use widgets that use javascript libraries and you are already importing them for your own use.

That usually ends up in a double import of the javascript library that in the best case is useless and in the worst one breaks everything. Looking around inside ToscaWidgets documentation you can end up finding that each Resource has an inject method that inserts the resource inside your template. As JSLink is a subclass of tw.api.Resource you can end up injecting the needed <script> tag inside your template and this will let ToscaWidgets know that you have already imported that javascript file.

For example if you are using tw.jquery to let ToscaWidgets know that you need jquery you can put a tw.jquery.jquery_js.inject() call inside the controller method that will require to render the template that uses jquery. If you need jquery on each page you can simply put it inside the __call__ method of your BaseController class (it is declared inside lib/base.py).

This is useful to include the same version of jQuery that tw.jquery will use, but even more it is useful to prevent tw.jquery to include jQuery again if you are already using the library for something different from a ToscaWidgets widget

5 lines RSS reader

Recently while creating AXANT Labs we decided to put inside the page a little RSS aggregator which should mix news from our projects, at first we took a look at Planet, but it was a bit too big for our needing so we developed this short RSS feed reader using Universal Feed Parser. I’m sharing this as the sources are really compact and might be useful in other situations

import feedparser, operator, time

feeds = (“http://blog.axant.it/feed”, “http://www.lscube.org/rss.xml”)

feeds = map(lambda x : feedparser.parse(x).entries, feeds)

feeds = reduce(operator.concat, feeds)

feeds = sorted(feeds, lambda x,y : cmp(y.date_parsed, x.date_parsed))

for entry in feeds: print ‘%s (%s) -> %s’ % (entry.title, time.strftime(“%Y-%m-%d %H:%M”, entry.date_parsed), entry.description)

Autotools Come Home

With our experience as Gentoo developers, me and Luca have had experience with a wide range of build systems; while there are obvious degrees of goodness/badness in build system worlds, we express our preference for autotools over most of the custom build systems, and especially over cmake-based build systems, that seem to be high on the tide thanks to KDE in the last two years.

I have recently written my views on build systems: in which I explain why I dislike CMake and why I don’t mind it when it is replacing a very-custom bad build system. The one reason I gave for using CMake is the lack of support for Microsoft Visual C++ Compiler, which is needed by some type of projects under Windows (GCC still lacks way too many features); this starts to become a moot point.

Indeed if you look at the NEWS file for the latest release (unleashed yesterday) 1.11, there is this note:

– The `depcomp’ and `compile’ scripts now work with MSVC under MSYS.

This means that when running configure scripts under MSYS (which means having most of the POSIX/GNU tools available under the Windows terminal prompt), it’s possible to use the Microsoft compiler, thanks to the compile wrapper script. Of course this does not mean the features are on par with CMake yet, mostly because all the configure scripts I’ve seen up to now seem to expect GCC or compatible compilers, which means that it will require for more complex tests, and especially macro archives, to replace the Visual Studio project files. Also, CMake having a fairly standard way to handle options and extra dependencies, can have a GUI to select those, where autotools are still tremendously fragmented in that regard.

Additionally, one of the most-recreated and probably useless features, the Linux-style quiet-but-not-entirely build, is now implemented directly in automake through the silent-make option. While I don’t see much point in calling that a killer feature I’m sure there are people who are interested in seeing that.

While many people seem to think that autotools are dead and that they should disappear, there is actually fairly active development behind them, and the whole thing is likely going to progress and improve over the next months. Maybe I should find the time to try making the compile wrapper script work with Borland’s compiler too, of which I have a license; it would be one feature that CMake is missing.

At any rate, I’ll probably extend my autotools guide for automake 1.11, together with a few extras, in the next few days. And maybe I can continue my Autotools Mythbuster series that I’ve been writing on my blog for a while.

Windows 7: To bother the user or not to bother?

Recently has been found out that Microsoft has a whitelist of processes than can run with administrative privileges to prevent bothering the user with UAC when the software should be secure. While this might be a nice thing to do, I don’t really understand why notepad.exe, mspaint.exe and calc.exe would require administrative rights o_O

Also, as the FileOpen dialog from Microsoft permits to modify the file system, I think that it might not be a good idea to give administrative rights for free to software that opens files using it (like notepad and mspaint).

You would say that the solution is simple: just remove software that access the file system from that white list and you will prevent the user from causing damages. Well, this is true, you will prevent the user from doing damages, but as Microsoft has been so kind to give us API like CreateRemoteThread, which doesn’t inform in any way by default the user that its little new puzzle game is going to inject some code inside its uber calc.exe with administrative rights, it is practically possible to run anything with administrative rights under Windows 7.

I think that Microsoft will remove a lot of applications from the white list in the near future, but the problem is that the only effective use of UAC is the one that was present in Vista: Bothering the user for every single thing that happens. In the meanwhile the author of the exploit has been so kind to provide a vide of the exploit itself on his page, you can take a look at it here

Labs has born

I’m glad to say that finally we decided to create labs.axant.it to have a common join point between all our projects. Labs will have news about every project where we are working on and also news from our blog.

Take a look at that if you are a lscube or pyhp user, you might find interesting news about the project you are following.

On Sprox

Lately I have tried to use Sprox with Elixir.

First of all I have to thank percious. He is incredibly reliable and helpful. There is actually a bug in sprox that makes him threat one-to-many relationships as one-to-one relationships and makes it show a single selection field instead of a multiple selection field. This can be avoided changing the field type to sprox.widgets.PropertyMultipleSelectField but percious has been so kind to fix it on the fly while I was testing the problem for him and now sprox correctly detects the field type by default.

Bad enough there is a big problem with Elixir. As Sprox probably creates internal instances of the Entity you pass to him this causes an  undesidered behaviour. When using SQLAlchemy, until you add the object to the session it won’t be saved on the database, but with Elixir creating an object means saving it on the db and this results in having multiple empty entities saved on db each time you open a forum generated with Sprox. If you have any required field in your entity your application will crash as it won’t be able to save it.

In the end I had to switch back from Elixir to DeclarativeBase for my application and everything worked fine

Using Elixir with TG2

I had to spend some time to permit to a project of ours to use Elixir inside TG2. Maybe someone with more experience than me might have a better answer, but I have been able to make Elixir work this way:

First of all I had to make Elixir use my TG2 metadata and session by adding to each model file that has a class inheriting from elixir.Entity this line:

from project_name.model import metadata as __metadata__, DBSession as __session__

Then I had to switch to model __init__.py and add elixir.setup_all() to init_model function just after DBSession.configure. This is really important as makes Elixir create all the SQLAlchemy tables and without this you won’t see anything happen for your elixir based models.

Also we can now import inside your model scope every elixir.Entity inherited class like we usually do for DeclarativeBase children.