Fork me on GitHub
Vijai Kumar S
A Space to share my views on World, Science and People

Scraping the web for good

Web scraping is an artform in my opinion. I do it for fun. Lot of webmasters hate scrapers because they are really abusive. But i simply do it for the sake of automating mundane stuff. Things like looking up an ip address in a blacklist (dnsbl), checking whether an ip is a part of Tor (Can be checked simply using a dns query) or simply downloading images from discussion boards in forums. Here is a simple one i wrote for doing basic ip lookup stuff.

########################################################
#                    Ip Lookup Tool                    #
#               Author : Vijai Kumar S                 # 
########################################################
import sys  
import urllib  
from BeautifulSoup import BeautifulSoup  
import re  
from netaddr import IPAddress  
import json  
from datetime import date

# Dnsbl Blacklist Check (ipvoid.com)
def blacklist(ipaddr):  
    '''
    Makes use of the blacklist lookup api in www.ipvoid.com
    I don't know how reliable it is. But something is better 
    than nothing

    Arguments:
      ipaddr --> a string (ipv4 or ipv6 address)
    '''
    black_url = 'http://www.ipvoid.com/scan/'
    url = black_url + ipaddr + '/'
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html)
    tags = soup('td')
    result = str(tags[3].text)      

    return result

# Tor Check https://exonerator.torproject.org
def torproject(ipaddr):  
    '''
    Makes use of the tor look up option made publicly available 
    by the torproject. Its a much more realistic service and is 
    rather faster since it is maintained by the tor project themeself. 
    I am not aware of any rate limiting at this point.

    Arguments:
    ipaddr --> a string (ipv4 or ipv6 address)
    '''
    tor_url = 'https://exonerator.torproject.org/?ip='
    today = date.today()
    urldate = today.isoformat()
    url = tor_url + ipaddr + '&timestamp=' + urldate

    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html)
    tags = soup('h3')

    ye = re.findall('Result is (.+)',str(tags[0].text))


    reply = str(ye[0])
    if reply == 'negative':
        torcheck = "No"
    elif reply == 'positive':
        torcheck = "Yes"
    else:
        torcheck = "Error"        

    return torcheck

#Geo IP Lookup
def geolookup(ipaddr):  
    '''
    Its a rather vanilla function which simply gets the geoip data from 
    geoip.net api. It returns a json which is simply decoded to get the
    geo info. A lot more could be done to this.

    Arguments:
    ipaddr --> a string (ipv4 or ipv6 address)
    '''
    geo_url = 'https://freegeoip.net/json/'
    valid_url = geo_url + ipaddr 
    geo_data = urllib.urlopen(valid_url).read()
    js = json.loads(str(geo_data))
    city = js["city"]
    region = js["region_name"]
    country = js["country_name"]
    lat = js["latitude"]
    lon = js["longitude"]
    return city,region,country,lat,lon

def main():  
    user_inp = str(raw_input('Enter the ip address :  ')).strip()
    ip = IPAddress(str(user_inp))

    if ip.version == 4:
        ipaddr = user_inp
    elif ip.version == 6:
        ipaddr = user_inp.replace(":","%3A")    
    else:
        print "Please enter only a valid ipv4 or ipv6 address"
        sys.exit()

    tor_result = torproject(ipaddr)
    city_r,region_r,country_r,lat_r,lon_r = geolookup(ipaddr)  
    black_result = blacklist(ipaddr)
    print '[ IPaddress : {} | TorNode : {} | Blacklist : {} | City : {} | Region : {} | Country : {} |\
 Latitude : {} | Longitude : {} ]'.format(ip,tor_result,black_result,city_r,region_r,
                                          country_r,lat_r,lon_r)

if __name__ == '__main__':  
    main()    

I started writing these silly scripts mainly because of an online class i am taking on coursera called Python for Everybody which mainly takes an informatics approach and teaches python with that. Its an really awesome course and i recommend it for anyone interesting in learning python. This course really opened me up to a lot of possibilities.

Since i had only written simple scientific computing related programs, it was really a good experience for me to play around with regular expressions and api's. Kids these days pick up stuff like this when they are 11 or 12 years old. But i had not seen a computer when i was that young. I was born in a very poor lower middle class family. But i have been enjoying this. I am educating myself from various online sources. I am planning to develop a very simple web application in python very soon with the help of flask web application framework.

I will try to write a lot of silly scripts like this in future. As far as the image scraping script i mentioned earlier, it almost got me banned from the forum since the admins thought i was dosing the server so i would not be making that public immediately. I hope you all love scraping too :)

comments powered by Disqus
subscribe