Apache conf serving static files with question mark in filename

If you have files on your filesystem with ‘?’ question mark in them, for example:

index.html?q=somestring

This can occur when you have site-ripped a PHP based website with for instance wget or other tools. If you access an url with ‘?’ in it, the webserver will try to serve the string after the question mark as a query string. This will not work, since there is no PHP, and result will be that you always get the index.html file, not the file named: “index.html?q=somestring”

This can be solved with an Apache rewritecond/rule configuration. Apache can be configured so that the ‘?’ is interpreted as part of the filename and not serving a query string.

Here is an example:
Directory "/var/www/html/path/to/files/";
AddType text/css .css?J
RewriteEngine On
RewriteBase /sites
RewriteCond %{QUERY_STRING} !^$
RewriteRule ^(.*)$ $1\%3F%{QUERY_STRING}? [L]

Web site ripping

For several days our Apache webserver went down. Stopped. No html document for you! CPU at 100%, sorry mate!

Our server is running Apache web-server as frontend with mod_jk to Tomcat, where a DSpace application (java) is running.

Backend database is a Postgresql 9.1 server. Everything running on a Redhat 6 virtual wmvare server.

From our Apache logs /var/log/http/access_log we saw there were hundreds of thousands GET requests from one (1) single IP address.

We compared the number of entries between Google search BOT and this IP adresse. The Google search BOT  had around 4000 entries, while this specific IP had more than 1.2 millions in the same time period! So….something had to be done…

My first thought was to just make an entry in the /etc/sysconfig/iptables files with something like:

-A INPUT -s 138.201.34.82/32 -p tcp --dport 443 -j DROP

To block everything out from this IP address. But, then I thought: They might have a good reason to copy our entire website (which has more than 300 GB of data).

Which is fine, everything is public anyway, as long they don’t crash the server!

So I thought, what if I can make our “visitor” to get some webpages now and then, maybe 1-3 per second and the rest of the GET request should be denied, meaning DROPPED at a TCP stack level.

After trying different rules, I came up to these ones being effective in this specific situation:

-A INPUT -p tcp --dport 80 -s 138.201.34.82/32 -m limit --limit 3/s --limit-burst 100 -j ACCEPT
-A INPUT -p tcp --dport 443 -s 138.201.34.82/32 -m limit --limit 3/s --limit-burst 100 -j ACCEPT
-A INPUT -p tcp --dport 80 -s 138.201.34.82/32 -j LOG --log-prefix "INPUT:DROP:SITERIPPER" --log-level 6
-A INPUT -p tcp --dport 443 -s 138.201.34.82/32 -j LOG --log-prefix "INPUT:DROP:SITERIPPER" --log-level 6
-A INPUT -p tcp --dport 80 -s 138.201.34.82/32 -j DROP
-A INPUT -p tcp --dport 443 -s 138.201.34.82/32 -j DROP

The first two lines is matching if the IP 138.201.34.82 is sending more than 100 TCP packets in “burst”. Which was happening all the time. I had to fine-tune limit-burst up and down from 50 up to 500 before ending on 100, the same with –limit, where I ended up with 3/s. There are no exact numbers here. One just have to fine-tune according to the number of incoming GET requests and  TCP packets.

After that, if the burst is reduced, then 3 packets per second are removed from a “bucket”, then allowing packets to get through again.

The line 3-4, are met, if the 1-2 lines above doesn’t match.
Then it will be logged to the system message file with the prefix “INPUT:DROP:SITERIPPER”, so that you can inspect the file /var/log/messages (on Redhat) afterwards, what happened.

The line 5-6 will finally drop the TCP packets that was initially not matched in line 1-2.

What I could see by comparing the Apache /var/log/httpd/access_log file and the /var/log/messages file, was that there was a kind of balance. Some GET requests where getting through, but, since this site-ripping were quite heavy, most were effectively dropped by the IP filter rules given above.

This is a typical entry I see in the apache log file after implementing the new rules, 1-3 per second:

138.201.34.82 - - [05/Aug/2016:17:51:52 +0200] "GET /discover?field=author&filter=Abolins%2C+Maris&filter_0=Abda.......many-many.....subject&filtertype_8=subject&filtertype_9=subject HTTP/1.1" 200 49104

compared to the 50-100 entries per second before.

In the /var/log/messages, I see now entries like this:
Aug 5 17:53:54 bibliotheca kernel: INPUT:DROP:IN=eth0 OUT= MAC=00:50:56:8b:25:14:84:78:ac:17:c3:c1:08:00 SRC=138.201.34.82 DST=129.177.6.72 LEN=52 TOS=0x00 PREC=0x00 TTL=49 ID=60058 DF PROTO=TCP SPT=45545 DPT=443 WINDOW=542 RES=0x00 ACK URGP=0
which tells me that TCP packets are indeed dropped.

Conclusion:

I can understand someone wants to copy our entire website, that is fine. But, if you do it aggressively, our server will have problems to respond to other visitors.

Other visitors will experience a slow web-server, and in the worst case no response at all.

It is easy to just block someone out by using DROP rules in Linux firewalls, like iptables, but at the same time, this site-rip might have good reason for trying to copy our website. Our website has thousands of PDF files that contains research content open to everyone. Which I can understand they like to have.

In total there is about 300 GB in our website, so the site-rip strategy might not be the best solution?

And also, by the way, we have a robots.txt file that explains which uri’s a robot should try not to index and avoid. Look here: https://bora.uib.no/robots.txt

In the DSpace case, trying to index /discover and /search-filter is not really want you want to do, because it might take forever!

A robots.txt is there to tell what is reasonable to index. Ignoring the robots.txt is not a good thing!

Maybe they just should contact us, and we could give a link for them to download everything? 🙂

Idle in transaction – DELETE waiting

We have a Linux server (Redhat Enterprise 6) running a multi-site installation of Mediawiki. In total, we have today 120 unique wiki’s, each having its one Postgresql 9.1 database.

Suddenly the server stopped responding. In Apache webserver, the total number concurrent apache processes went up to Apache’s max-client setting in httpd.conf, and people couldn’t log in or see any wiki-pages.

In the apache log, we did not see anything special. But, when we started to check which processes were running during a full stop on the server with the Unix command ps, we could see entries like:

A search on Internet found that also others experienced this, and the solution to our problem we found here:

* http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg00109.html

* https://wiki.duraspace.org/display/DSPACE/Idle+In+Transaction+Problem#IdleInTransactionProblem-Workaround:Killing

We implemented a fix to our problem by adding a bash script to /etc/crontab, running each fifth minute.

The script will first check if there are more then three (3) processes with the description “idle in transaction”.

If so, then the pkill command will stop (“kill”) the oldest one.

fix-transactions.sh:

The script runs every fifth minute. Here is our entry in the /etc/crontab file.

*/5 * * * * root /usr/share/scripts/fix-transactions/fix-transactions.sh

 

Rewriterule HTTP_HOST in Apache

We had to rewrite some specific domain names in our WordPress multisite.

Most of our domains are on this form: something.w.uib.no

In our case that is nice, because we also have a valid ssl certificate which supports *.w.uib.no.

But in addition we have other domain names on the form: something.uib.no

which we don’t have a valid ssl cerificate. When a wordpress administrator tries to log in to:

* http://something.uib.no/wp-login

he/she is met with a scary ssl certificate warning. (Because we don’t have SSL certificates for these domain-names)

Our solution was to redirect this request with some Apache rewrite rules:

RewriteEngine On
RewriteCond %{REQUEST_URI} ^/wp-login\.php [or]
Rewritecond %{REQUEST_URI} ^/wp-admin$
RewriteCond %{HTTP_HOST} !^(([^.])+\.b\.uib\.no)$
RewriteCond %{HTTP_HOST} ^([a-z.]+)?uib\.no$ [NC]
RewriteRule .? https://%1w.uib.no%{REQUEST_URI} [R=301,L]

which says something like:
If REQUEST_URI starts with wp-login.php OR wp-admin, AND the HTTP_HOST (domainname) is NOT LIKE something.w.uib.no AND HTTP_HOST IS something.uib.no, THEN
rewrite the whole thing to a new URL with  301 (permanently moved). The %1, which is the match in the regular expression above is part of the new URL.

In this way we could make sure that the users of wordpress instances with http://something.uib.no, would be redirected to a proper domain name, and thus with a valid SSL certificate.

See also:
http://www.webforgers.net/mod-rewrite/mod-rewrite-syntax.php
http://support.hostgator.com/articles/cpanel/articles/getting-started/general-help/apache-mod_rewrite-and-examples
http://httpd.apache.org/docs/current/mod/mod_rewrite.html (se: RewriteCond backreferences)

You might ask “Why not just order and install the SSL certificates as needed?”.

The answer is that ordering SSL certificates is a very manual process and takes a lot of administration and time, involving a lot of people. We want to avoid such and make things as simple as possible.