Web site ripping

For several days our Apache webserver went down. Stopped. No html document for you! CPU at 100%, sorry mate!

Our server is running Apache web-server as frontend with mod_jk to Tomcat, where a DSpace application (java) is running.

Backend database is a Postgresql 9.1 server. Everything running on a Redhat 6 virtual wmvare server.

From our Apache logs /var/log/http/access_log we saw there were hundreds of thousands GET requests from one (1) single IP address.

We compared the number of entries between Google search BOT and this IP adresse. The Google search BOT  had around 4000 entries, while this specific IP had more than 1.2 millions in the same time period! So….something had to be done…

My first thought was to just make an entry in the /etc/sysconfig/iptables files with something like:

-A INPUT -s -p tcp --dport 443 -j DROP

To block everything out from this IP address. But, then I thought: They might have a good reason to copy our entire website (which has more than 300 GB of data).

Which is fine, everything is public anyway, as long they don’t crash the server!

So I thought, what if I can make our “visitor” to get some webpages now and then, maybe 1-3 per second and the rest of the GET request should be denied, meaning DROPPED at a TCP stack level.

After trying different rules, I came up to these ones being effective in this specific situation:

-A INPUT -p tcp --dport 80 -s -m limit --limit 3/s --limit-burst 100 -j ACCEPT
-A INPUT -p tcp --dport 443 -s -m limit --limit 3/s --limit-burst 100 -j ACCEPT
-A INPUT -p tcp --dport 80 -s -j LOG --log-prefix "INPUT:DROP:SITERIPPER" --log-level 6
-A INPUT -p tcp --dport 443 -s -j LOG --log-prefix "INPUT:DROP:SITERIPPER" --log-level 6
-A INPUT -p tcp --dport 80 -s -j DROP
-A INPUT -p tcp --dport 443 -s -j DROP

The first two lines is matching if the IP is sending more than 100 TCP packets in “burst”. Which was happening all the time. I had to fine-tune limit-burst up and down from 50 up to 500 before ending on 100, the same with –limit, where I ended up with 3/s. There are no exact numbers here. One just have to fine-tune according to the number of incoming GET requests and  TCP packets.

After that, if the burst is reduced, then 3 packets per second are removed from a “bucket”, then allowing packets to get through again.

The line 3-4, are met, if the 1-2 lines above doesn’t match.
Then it will be logged to the system message file with the prefix “INPUT:DROP:SITERIPPER”, so that you can inspect the file /var/log/messages (on Redhat) afterwards, what happened.

The line 5-6 will finally drop the TCP packets that was initially not matched in line 1-2.

What I could see by comparing the Apache /var/log/httpd/access_log file and the /var/log/messages file, was that there was a kind of balance. Some GET requests where getting through, but, since this site-ripping were quite heavy, most were effectively dropped by the IP filter rules given above.

This is a typical entry I see in the apache log file after implementing the new rules, 1-3 per second: - - [05/Aug/2016:17:51:52 +0200] "GET /discover?field=author&filter=Abolins%2C+Maris&filter_0=Abda.......many-many.....subject&filtertype_8=subject&filtertype_9=subject HTTP/1.1" 200 49104

compared to the 50-100 entries per second before.

In the /var/log/messages, I see now entries like this:
Aug 5 17:53:54 bibliotheca kernel: INPUT:DROP:IN=eth0 OUT= MAC=00:50:56:8b:25:14:84:78:ac:17:c3:c1:08:00 SRC= DST= LEN=52 TOS=0x00 PREC=0x00 TTL=49 ID=60058 DF PROTO=TCP SPT=45545 DPT=443 WINDOW=542 RES=0x00 ACK URGP=0
which tells me that TCP packets are indeed dropped.


I can understand someone wants to copy our entire website, that is fine. But, if you do it aggressively, our server will have problems to respond to other visitors.

Other visitors will experience a slow web-server, and in the worst case no response at all.

It is easy to just block someone out by using DROP rules in Linux firewalls, like iptables, but at the same time, this site-rip might have good reason for trying to copy our website. Our website has thousands of PDF files that contains research content open to everyone. Which I can understand they like to have.

In total there is about 300 GB in our website, so the site-rip strategy might not be the best solution?

And also, by the way, we have a robots.txt file that explains which uri’s a robot should try not to index and avoid. Look here: https://bora.uib.no/robots.txt

In the DSpace case, trying to index /discover and /search-filter is not really want you want to do, because it might take forever!

A robots.txt is there to tell what is reasonable to index. Ignoring the robots.txt is not a good thing!

Maybe they just should contact us, and we could give a link for them to download everything? 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *