Get url’s from website with wget

wget --spider --no-check-certificate --force-html -r -l2 https://website.of.interest.com 2>&1 | grep '^--' | awk '{ print $3 }' | grep -v '\.\(css\|js\|png\|gif\|jpg\)$' >> url-list.txt

which will give something like:
https://website.of.interest.com/en/Category:BRA
https://website.of.interest.com/en/Category:Free_software
https://website.of.interest.com/en/Category:Freeware
https://website.of.interest.com/en/Special:RecentChangesLinked/Software_overview
https://website.of.interest.com/en/Special:WhatLinksHere/Software_overview
https://website.of.interest.com/en/Talk:Software_overview
https://website.of.interest.com/ithelp/index.php?title=Software_overview&action=edit
https://website.of.interest.com/ithelp/index.php?title=Software_overview&action=history
https://website.of.interest.com/ithelp/index.php?title=Software_overview&action=info
https://website.of.interest.com/ithelp/index.php?title=Software_overview&oldid=2586
https://website.of.interest.com/ithelp/index.php?title=Special:Pdfprint&page=Software_overview
https://website.of.interest.com/ithelp/index.php?title=Special:UserLogin&returnto=Software+overview
https://website.of.interest.com/ithelp/index.php?title=Talk:Software_overview&action=edit&redlink=1

Match email address

If you have a file with several lines that contains email addresses that you like to extract, you can do that with Linux grep command:

File “emails.txt”:

somestreng anotherstring 342345 somename.lastname@domain.com

where somename.lastname@domain.com is the string you like to extract:

you can do this with:

grep -EiEio '\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' emails.txt

Result:

somename.lastname@domain.com

Idle in transaction – DELETE waiting

We have a Linux server (Redhat Enterprise 6) running a multi-site installation of Mediawiki. In total, we have today 120 unique wiki’s, each having its one Postgresql 9.1 database.

Suddenly the server stopped responding. In Apache webserver, the total number concurrent apache processes went up to Apache’s max-client setting in httpd.conf, and people couldn’t log in or see any wiki-pages.

In the apache log, we did not see anything special. But, when we started to check which processes were running during a full stop on the server with the Unix command `ps`, we could see entries like:

[php]
# ps -ef | egrep ‘DELETE|idle’
postgres 25106 2502 0 11:12 ? 00:00:00 postgres: ubwiki ubwiki 127.0.0.1(37186) idle in transaction
postgres 25108 2502 0 11:12 ? 00:00:00 postgres: ubwiki ubwiki 127.0.0.1(37189) DELETE waiting
postgres 25447 2502 0 11:13 ? 00:00:00 postgres: ubwiki ubwiki 127.0.0.1(37420) DELETE waiting
postgres 25619 2502 0 11:14 ? 00:00:00 postgres: ubwiki ubwiki 127.0.0.1(37553) DELETE waiting
[/php]

A search on Internet found that also others experienced this, and the solution to our problem we found here:

* http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg00109.html

* https://wiki.duraspace.org/display/DSPACE/Idle+In+Transaction+Problem#IdleInTransactionProblem-Workaround:Killing

We implemented a fix to our problem by adding a bash script to /etc/crontab, running each fifth minute.

The script will first check if there are more then three (3) processes with the description “idle in transaction”.

If so, then the pkill command will stop (“kill”) the oldest one.

fix-transactions.sh:

[bash]
#!/bin/bash
if [ `/usr/bin/pgrep -f ‘idle in transaction’ | wc -l` -gt 2 ]
then
echo "Email message about that process was found and stopped." | mail to@someone
fi
/usr/bin/test `/usr/bin/pgrep -f ‘idle in transaction’ | wc -l` -gt 2; pkill -o -f ‘idle in transaction’
[/bash]

The script runs every fifth minute. Here is our entry in the /etc/crontab file.

[code lang=”bash”]
*/5 * * * * root /usr/share/scripts/fix-transactions/fix-transactions.sh
[/code]

 

Random numbers with bash

I needed to get some quick random numbers that would represent a time on the form HH:SS. In my case the time should be between 10:00 and 12:55.

I did this in bash with:

#!/bin/bash
# If the random number is 1, then LAST-number is set to 0.
# Else it would be set to 5.
if [ $(( $RANDOM % 2 )) -eq 1 ];then
  LAST=0
else
  LAST=5
fi

# Print a number between  10 or 12,
# then a semicolon, then a number between 0-5,
# and finally either 0 or 5.
echo "0"$(( $RANDOM % 2 + 10 ))":"$(( $RANDOM % 6 ))$LAST

Sending an email with a different From: address

Sometimes you have to send an email with a different “From:” address. This can be easily done in LINUX. Please see the video below, and read the text below the video for further explanations. If you have questions, please, just add your comment, I will be happy to answer 🙂

Was the video helpful? Here are some explanations of the commands that I used:

The:

export EMAIL="Title Firstname Lastname <differentaddress@validemail.uib.no>"

sets the environment variable EMAIL to what you specify between the two “”. Note, the syntax has to be correct, remember the two <>

The command:

echo -e "some text" | mutt -s "subject text" somename@somewhere

means:

Send the text “some text” through a pipe (the character “|”)  to the program “mutt”, which will send the email to somename@somewhere with the subject “subject text” and the body-text “some text”

If you need to send to multiple people, this is achieved by making the list of email addresses separated with a comma. For example:

echo -e "my message to you" | mutt -s "my subject text" firstperson@something.no,secondperson@somethingelse.no

NB: The Linux command above is written on one-line. No Enter or Returns on keyboard should be done.

Now you can send your emails with any From: address you like.

Remember: it is still your account that sends the email, so changing the From: address doesn’t hide your real identity in the email system.

Searching in Excel files

I had to find a specific Microsoft Spreadsheet among thousands of files on a mapped network drive O: on a Windows 7 Computer. The problem was that searching and indexing was only performed on local disk like C: and D: so I could only search for keyword in local Excel files. It is not optimal, because it means copying thousands of temporary excel files to a local drive (D:), but as long as I know that it was a Excel file, I could copy only this file-types. But, when you need something, you need something. This is how I solved it.

– Mount the Windows share on a Linux server (requires sudo rights)

– Find the total size of all excel files on mapped network drive (so that you know the size of the total number of files that you have to put on your local Drive)

– Copy all excel files from share to local disk

– Let Windows 7 index the excel files locally (should happen automatically when new files are added)

– Finally, search for keywords in Windows Explorer and finding the Excel spreadsheet in question.

First mount the share:
sudo mount -t cifs -o username=,domain=example.com //WIN_PC_IP/ /mountdir

where mountdir is any name for a folder. The mount command will create it. It could be your username for instance.
WIN_PC_IP is the ip number of the Windows computer where your share is located.

To find the total size of the files:

cd mountdir

find . -name '*.xls*' -exec ls -l {} \; | awk '{ Total += $5} END { print Total}'

Then find all the excel files, and copy them to a new folder:
cd ..
mkdir EXCEL-FILES-FOLDER/
find . -iname '*.xls' -exec cp --parent {} EXCEL-FILES-FOLDER/ \;
find . -iname '*.xlsx' -exec cp --parent {} EXCEL-FILES-FOLDER/ \;

Copy the files to local D: drive (using CMD in Windows 7)

D:\>copy "o:\EXCEL-FILES-FOLDER\*" "d:\EXCEL-FILES-FOLDER\"

 
You might have to wait some hours, because if you have a lot of files, the Windows 7 computer might use a day or two before it is finished. “PATIENCE YOU MUST HAVE my young padawan”..

Open Journal System missing email logs

I upgraded our Open Journal Systems from 2.2.2 to 2.4.2 and afterwards we saw the email logs were gone. This was related to:

http://pkp.sfu.ca/support/forum/viewtopic.php?f=8&t=9140

My solution was this:
Established a new Postgresql server in a virtual enviroment (vmware, test-machine), and import the old and the new (upgraded) database.

# Log in to Prod server
ssh prodserver
# Dump the related and upgraded database to a sql file
pg_dump -U postgres -f /tmp/ojs_new.sql ojs_new
# Copy the file to a Test server where you can play around
scp ojs_new.sql username@my_vmware_test_server
# Log in to your Test server
ssh my_vmware_test_server
# Become Root, the system super user
sudo su -
# Become Postgres user, so that you can do all Postgresql stuff
su - postgres
# Create a user named "ojs"
createuser ojs
# Create the databases
createdb -O ojs ojs_new
createdb -O ojs ojs_old
# Import the databases to Postgresql on the Test Vmware server
psql -f ojs_old.sql ojs_old
psql -f ojs_new.sql ojs_new

ojs_old: was the original 2.2.2 database
and
ojs_new: was the upgraded 2.4.2 one

In order to fix the assoc_id numbers, I used this bash script:

myscript.sh:

#!/bin/bash
# Find first all log_id's in the old database (ojs_old)
ORIGLIST=`psql -t -d ojs_old -c "SELECT log_id FROM article_email_log" postgres`;
for i in $ORIGLIST;do
  OLD_ARTICLE_ID=`psql -t -d ojs_old -c "SELECT article_id FROM article_email_log WHERE log_id = '$i'" postgres`
  echo -e "psql -d boap_journals -c \"UPDATE email_log SET assoc_id = $OLD_ARTICLE_ID WHERE log_id = $i\" postgres"
done

The output from this script is sent to STDOUT. I redirected it to another script, and ran the script on our Production server.

ssh my_vmware_test_server
sudo su -
# Send the output of the script to file
./myscript.sh > runme-on-prod-server.sh
# copy the file to the Production server
scp runme-on-prod-server.sh username@prodserver:/tmp/
# log in to the Production server
ssh prodserver
# Become root, be careful!
sudo su -
cd /tmp/
# Run the script! (But! Remember to do a PG database dump before, so you can restore if something goes terrible wrong!)
# become Postgresql super user
su - postgres
pg_dumpall -f mybackup_just_in_case.sql
# become root again
exit
./runme-on_prod-server.sh