Virus- and ad-filtering in HTTP proxies

With the increasing number of Web-based technologies the importance of the Web access of a workgroup or a home environment is getting higher as well, but so does the number of unwanted abusive ‘technologies’ as well.

The client-side filtering of malicious content (viruses, trojans, malware, etc.) consumes resources, requires multiple license fees to be paid, and requires noticeable efforts in managing these protective solutions and keeping all of them up-to-date.

The advertisements embedded on web pages distract the attention and consumes a vast amount of bandwidth, and last but not least, if the non-changing content is cached within the intranet and only the transient data must be fetched, then another amount of load can be relieved from the downlink usage.

Some statistics of a popular news/commercial site:

  • Gross content:
    • Shockwave flash: 380 kBytes in 9 files
    • Images: 1900 kBytes in 146 files
    • Html/CSS/JS: 1156 kBytes in 50 files
    • Other: 128 kBytes in 8 files
    • Total: 3564 kBytes in 213 files
  • Advertisement-related, could be filtered:
    • Shockwave flash: 380 kBytes in 9 files
    • Images: 765 kBytes in 55 files
    • Html/CSS/JS: 230 kBytes in 23 files
    • Total: 1374 kBytes in 87 files
  • Non-changing content (design images, 3rd party JS libs, etc.), could be cached:
    • Images: 48 kBytes in 53 files
    • Html/CSS/JS: 408 kBytes in 13 files
    • Total: 456 kBytes in 66 files
  • Net content, shall be fetched anyway:
    • Images: 1088 kBytes in 38 files
    • Html/Css/JS: 518 kBytes in 14 files
    • Other: 128 kBytes in 8 files
    • Total: 1733 kBytes in 60 files

As you can see, 38.5% of the traffic can be spared, 12.8% must be downloaded just once, and only 48.6% needs to be fetched at every request, what means a 2x boost of the effective bandwidth, achieved by filtering and caching.

System design

To maximise the throughput we must first eliminate the unwanted requests and then strive to perform the resource-expensive tasks as few times as possible. From this principle we can directly derive the following design guidelines

  • URL-based ad filtering must happen as soon as possible, so neither bandwidth nor processing capacity will be wasted on content that would be discarded anyway.
  • Content-based filtering (like virus scanning) is resource-expensive, its result must be cached if possible, so consequential requests for the same content will not waste processing capacity on re-scanning the content again and again.
  • Scalability is important, so we must be able to delegate parts of the process to clustered servers or ‘scanning farms’.
  • This is only a small part of a network architecture, so it must support cooperation with other web caches as well.
  • The content scanner framework must support various scanner engines, and automatic update of the signature database is a plus.
  • Secure (https) connections can be neither cached nor filtered by content, so they may bypass all the steps beyond URL-based filtering.
    • These guidelines quite well determine the system components that fit our needs

      • privoxy as a URL-based ad filter, because it is a non-invasive one, it supports replacing filtered content with override links that make them available if explicitely clicked upon
      • squid as a caching proxy, because of its sophisticated configuration features and good interoperation of other caches
      • havp as a virus filter, bacause it supports 11 different scanner engines, including both free and commercial ones

      The system will be basically a chain of them, each referring to the next one as a parent proxy. In our one-machine proof-of-concept setup they will use ‘localhost’ as server names, but if distributed between several servers, these may be the IPs of the next-level machines, or -in case of scanner farms- even just server names that resolve to different IPs in a round-robin fashion. (The configuring of such DNS resolution is beyond the scope of this article, for it please refer to the BIND9 Administration Resource Manual for examples.)

      First step: privoxy

      As privoxy is available in the standard Debian distribution, we installed it using

      apt-get install privoxy

      and then configured it by editing

      /etc/privoxy/config

      We set it to listen on tcp port 3128

      listen-address :3128

      use localhost:3129 as a parent proxy for http connections (that will be squid)

      forward / 127.0.0.1:3129

      and pass https directly to the origin server

      forward :443 .

      All these options are well documented in the config file itself, so for detailed description please refer to the comments in it.

      Although privoxy is shipped with a reasonably good default setup, you may want to add your local filtering rules to the end of

      /etc/privoxy/user.action

      like for example:

      {+block{Spam content filtered}}
      .adocean.pl
      .gemius.pl
      .amung.us
      .tynt.com
      .sitemeter.com
      .delicious.com
      safebrowsing-cache.google.com
      safebrowsing.clients.google.com
      .specificclick.net
      .scorecardresearch.com
      .digitalpoint.com
      .nemvaltozik.hu
      fxfeeds.mozilla.com
      feeds.bbci.co.uk
      newsrss.bbc.co.uk
      *banner*
      */.*banner
      

      Second step: squid

      Squid is also available as Debian package, so the installation again is merely

      apt-get install squid

      and the configuration happens by editing

      /etc/squid/squid.conf

      As by default no incoming connections are allowed, we must enable this

      # INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
      http_access allow all
      

      set a visible hostname for the error message pages

      visible_hostname something.in.your.organisation

      then tell squid to listen on 3129/tcp

      http_port 3129

      and finally direct the outgoing queries to the parent proxy localhost:3130 (that will be havp)

      cache_peer  localhost   parent  3130    7   no-query

      (NOTE: It was not needed for this setup, but squid could also perform client authentication here.)

      Third step: havp

      As you already might have guessed, havp also comes from standard Debian package

      apt-get install havp clamav-base clamav-freshclam libclamav6

      For an antivirus engine we chose clamav because it’s free and easy to set up, but if you have other preferences, havp supports Clamav, F-Prot, AVG, Sophie, Trophie, NOD32, AVast, ARCAVir and DrWeb, so the choice is yours.

      Configuration goes the usual way, edit

      /etc/havp/havp.config

      Tell havp to listen on 3130/tcp, enable the libclamav-based scanner backend and set the basic performance parameters

      PORT 3130
      SERVERNUMBER 20
      MAXSERVERS 100
      ENABLECLAMLIB true
      

      The default configs of libclamav and freshclam were just fine for us, but if you would like to change some of their settings, you may find the config files at

      /etc/clamav/

      This setup now works for ‘http://’ and ‘https://’ connections, but what about ‘ftp://’? Everything goes fine until those requests arrive at havp, but then they are rejected because havp can talk only http, so it cannot connect to an ftp server.

      To overcome this obstacle, we should introduce another parent proxy, now for havp itself, so havp should only talk http to it, and then this parent proxy would talk ftp towards the origin server. This could be done by (for example) another squid instance, but that would be an overkill to manage and operate such a big server for such a small purpose. We already have a squid running, so would it be possible to use that for routing the requests sent by havp as well?

      The answer is affirmative, it is possible, however the issue is a bit more complex than as it seems at the first glance:

      • There would be infinite loops like ‘client-privoxy-squid-havp-squid-havp-squid-…’
      • All content would pass through squid two times:
        1. First when squid fetches it from the origin server for havp
        2. Second when squid fetches it from havp for privoxy

        As squid is primarily a caching web proxy, it would cache the (not yet virus-checked) content at the first turn, and then it would serve this unchecked content for privoxy because of the cache hit

      To solve this issue squid must be able to

      • Distinguish requests coming from privoxy from the ones coming from havp
      • The requests coming from havp should be neither cached nor forwarded to havp

      For marking the requests that come from havp we can use their source IP address: if havp sends them from an IP different than the one of privoxy, then squid can simply handle them in a different way. For this we must configure havp again

      Tell it to use 127.0.0.2 as a source address for its outgoing requests

      SOURCE_ADDRESS 127.0.0.2

      and tell it to use the squid as a parent proxy

      PARENTPROXY 127.0.0.1
      PARENTPORT 3129
      

      In a distributed environment where havp instances are running on dedicated machines, there is no need for such ‘127.0.0.2’ trickery, but the IPs or subnets of these scanner machines can be used directly. In fact, this is exactly the infrastructure we are now simulating on this single machine…)

      These alone would still cause both the looping and the caching problem, so we must also do some configuring to the squid again.

      Fourth step: squid again

      Now we must make squid to recognise the traffic that comes from havp

      acl from_havp src 127.0.0.2/32

      make squid never to cache such traffic

      cache deny from_havp

      and finally prevent squid from forwarding such traffic towards havp again

      always_direct allow from_havp
      never_direct allow !from_havp
      

      This way the requests from havp (but only them) will be served directly from the origin server, and the other requests (all of them) will be passed to havp.

      Summary

      Using this setup, the lifecycle of a request goes like this

      1. The request arrives to :3128, to privoxy, from the client
      2. If it is considered as spam, then an error page/image/link is returned immediately
      3. Otherwise the request is passed to :3129, to squid
      4. If it is already present in squid’s cache, the content is returned at this point
      5. Otherwise the request is passed to :3130, to havp
      6. HAVP again asks squid to fetch the content
        1. The new request arrives to :3129, to squid, but now from the source IP 127.0.0.2
        2. Squid recognises the source IP, so disables caching for this request
        3. Squid fetches the content directly from the origin server and returns it to HAVP
      7. HAVP scans the content
      8. If it is infected
        • If it is curable, then HAVP removes the infection
        • Otherwise HAVP generates an error report page
      9. HAVP returns the scanned and safe content to squid
      10. Squid adds the content to its cache for future requests
      11. Squid returns the safe and cached content to privoxy
      12. Privoxy returns the content to the client

      The system consists of only free and open-source components, it is relatively easy to set up and configure, scales quite well and integrates fine with other systems, so it can be recommended both for home use and as part of corporate IT infrastructure as well.

      Advertisements

1 thought on “Virus- and ad-filtering in HTTP proxies”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s