I offer my assistance in all aspects of the computer software development.

Mirek Soja's Website

Changing the world, one line of code at a time…

Arrow Left Return to the List


Document Protecting Published E-Mail AddressesAgainst E-Mail Spider Systems.

(Part One)
by Miroslaw M. Soja


First Published: April 6th, 2004

Mystery in a Server's Log File
     Once I have examined my web server's log file and noticed following lines:

[(...)] \"GET /?topic=essays&title=a-rose-for-emily HTTP/1.1\" 200 17726
[(...)] \"GET /?topic=chicken2chicken&title=happy-headache HTTP/1.1\" 200 11280
[(...)] \"GET /?topic=stories HTTP/1.1\" 200 14122
[(...)] \"GET /?topic=terms HTTP/1.1\" 200 14092
[(...)] \"GET /?topic=stories&title=thomas-grandmas-house HTTP/1.1\" 200 15020
[(...)] \"GET /?topic=stories&title=it-was-tomorrow HTTP/1.1\" 200 15376
[(...)] \"GET /?topic=stories&title=baby-sitter HTTP/1.1\" 200 10463
[(...)] \"GET /?topic=stories&title=installing-love HTTP/1.1\" 200 15637


     At the first glance, it was OK - nothing unusual - just standard requests for a web page's content from some visitor. However, I know that after each listed line (request for the page content) there should be associated sequence of other requests for presented graphics files in the pages. Therefore, the visitor has hurt my feelings because it seems that one does not want to see my "beautiful" graphics!

     Seriously, there are many reasons that these types of requests appear in server's logs. However, one of the reasons, that web designers must be aware, is a request for page's content from the e-mail addresses collecting system. The purpose of such systems, it is collecting e-mail addresses presented in the web page and save them in special databases.

     Precisely, e-mail hunting programs visit a web page, and collect all strings that contain character '@'. Later, these strings are filtered, processed and saved in databases. In consequence, many of us are receiving via e-mail a lot of 'exciting' offers from web site's promotion to credit cards an so on.

     For this reason, we must develop an efficient way to hide e-mail addresses in web pages we create for our customers -- it's our moral duty to do our best in order to protect our customers. In other words, an e-mail address should be available to real web site's visitors and difficult to detect by programs hunting for e-mail addresses.

E-mail Entry Forms
     More efficient method that protects e-mail addresses from e-mail hunting programs, it is using contact forms instead of publishing e-mail addresses. However, in spite of the fact that e-mail address is not present in the web page's content, contact forms are not perfect solutions for a simple reason: users are not perfect too.

     Certainly, people can always make mistake in typing return address, and so on. For this reason, companies and individuals could lose important contacts due to lack of reply to one's inquiry. Besides, people became tired for filling the same blanks: name, ID, e-mail, telephone, and so on - there are too many boring data entry forms in the Internet.

A Top Secret Weapon: Numeric Character References
     In other words, we have faced a challenge -- we must develop a method that efficiently hides published e-mail addresses from non-spiders scanning our customers' web pages. In this battle we a weapon -- numeric character references.

     For instance, browsers accept the numeric character references as a valid web page's content; therefore, an email address info@webquake.ca is expressed in a html file following:

info@webquake.ca

     The e-mail address, which is expressed in numeric character references, is definitively more time consuming for processing by e-mails harvesting systems -- these programs must scan and process a huge number of files per day. For this reason, in order to save time, it is highly probable that web scanning systems identity e-mail addresses only by presence of '@' character.

Solution in PHP
     The PHP library has a function ord(). The function returns the ASCII value of an character -- it fits perfectly in order to implement conversion to numeric character references. A PHP code for converting strings into sequence of numeric character references is following:

<?php
// - - - Author: Miroslaw M. Soja a.k.a. Mirek
// - - - Copyright © 2002 Miroslaw M. Soja
// - - - Copyright © 2002 www.mirek.ca
//
function strspecialconvertion($str)
{
$ret = "" ;
$indexlimit = strlen($str);
$indx = 0 ;
while($indx < $indexlimit)
{
$ret .= "&#";
$ret .= ord(substr($str,$indx)) ;
$ret .= ";" ;
++$indx ;
}
return($ret) ;
}
?>

     Next, the below PHP code is an implementationfor a function emailprotectedlink( $emailaddr,$linktitle ). The function calls previously mentioned strspecialconvertion($str) in order to encode strings $emailaddr, $linktitle in the input. On the output, the function returns a string -- ready to display in browser an encoded email address within HTML tags <A>....</A>
   <?php
// - - - Author: Miroslaw M. Soja a.k.a. Mirek
// - - - Copyright © 2002 www.mirek.ca
//
function emailprotectedlink( $emailaddr, $linktitle="" )
{
$title = ? (trim($tilte) == "") $emailaddr : $linktitle ;
$protectaddr = "" ;
$emailaddr = "mailto:".trim($emailaddr) ;
$protectaddr = strspecialconvertion($emailaddr) ;
$protectitle = strspecialconvertion($title) ;
$emaillink = "<A href=\"{$protectaddr}\">".$protectitle."</A>" ;
return($emaillink) ;
}
?>

     Finally, an example for calling the function within the PHP code is following:
    If you wish to contribute your article to our WebQuake, then please contact with our
<?php
$emailaddr="info@webquake.ca"
$linktitle="Info Department"
emailprotectedlink( $emailaddr, $linktitle ) ;
?>".

Results displayed in a browser
     If we apply one of the presented solutions in PHP or Perl, a browser would displays following result:
        If you wish to contribute your article to our WebQuake, 
then please contact with our info department

Conclusion
     At this moment, the presented methods provide quite fair protection for e-mail addresses located in web pages. However, the battle is not over because of technological improvements in increasing operational speed of microprocessors. In consequence, in the near future, the e-mail harvesting systems would probably be able to implement more sophisticated algorithms in order to detect even e-mail addresses expressed in numeric character references. For this reason, we must continue our research in order to develop more sophisticated methods that protect e-mail addresses displayed at web pages in the future.

References
  1. http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1 World Wide Web Consortium (W3C) -- Numeric Character References
  2. http://www.php.net/manual/en/function.ord.php PHP Documentation -- function org()

Arrow Left Return to the List




Symbol Blue Information I support the following technologies:

tuxmacosx_universal_50pxphp_med_trans_light100x58_1-2pasted-graphicfooter_logo_ubuntu
and so on...