Главное Свежее Вакансии   Проекты
167 0 В избр. Сохранено
Авторизуйтесь
Вход с паролем

10 tips to avoid getting Blocked while Scraping Websites

Data Scraping is something that has to be done quite responsibly. You have to be very cautious about the website you are scraping. It could have negative effects on the website. There are FREE web scrapers in the market which can smoothly scrape any website without getting blocked. Many websites on the web do not have any anti-scraping mechanism but some of the websites do block scrapers because they do not believe in open data access.

One thing you have to keep in mind is BE NICE and FOLLOW SCRAPING POLICIES of the website But if you are building web scrapers for your project or a company then you must follow these 10 tips before even starting to scrape any website.

1. ROBOTS.TXT


First of all, you have to understand what is robots.txt file and what is its functionality. So, basically it tells search engine crawlers which pages or files the crawler can or can’t request from your site. This is used mainly to avoid overloading any website with requests. This file provides standard rules about scraping. Many websites allow GOOGLE to let them scrape their websites. One can find robots.txt file on websites — http://example.com/robots.txt.Sometimes certain websites have User-agent: * or Disallow: / in their robots.txt file which means they don’t want you to scrape their websites. Basically anti-scraping mechanism works on a fundamental rule which is: Is it a bot or a human? For analyzing this rule it has to follow certain criteria in order to make a decision.Points referred by an anti-scraping mechanism:

  1. If you are scraping pages faster than a human possibly can, you will fall into a category called «bots».
  2. Following the same pattern while scraping. Like for example, you are going through every page of that target domain for just collecting images or links.
  3. If you are scraping using the same IP for a certain period of time.
  4. User Agent missing. Maybe you are using a headerless browser like Tor Browser

If you keep these points in mind while scraping a website, I am pretty sure you will be able to scrape any website on the web.

2. IP Rotation


This is the easiest way for anti-scraping mechanisms to caught you red-handed. If you keep using the same IP for every request you will be blocked. So, for every successful scraping request, you must use a new IP for every request. You must have a pool of at least 10 IPs before making an HTTP request. To avoid getting blocked you can use proxy rotating services like Scrapingdog or any other Proxy services . I am putting a small python code snippet which can be used to create a pool of new IP address before making a request.

from bs4 import BeautifulSoup import requestsl={} u=list()url=”https://www.proxynova.com/proxy-server-list/country-"+country_code+"/" respo = requests.get(url).textsoup = BeautifulSoup(respo,’html.parser’)allproxy = soup.find_all(“tr”)for proxy in allproxy: foo = proxy.find_all(“td”) try: l[“ip”]=foo[0].text.replace(“\n”,””).replace(“document.write(“,””).replace(“)”,””).replace(“\’”,””).replace(“;”,””) except: l[“ip”]=None try: l[“port”]=foo[1].text.replace(“\n”,””).replace(“ “,””) except: l[“port”]=None try: l[“country”]=foo[5].text.replace(“\n”,””).replace(“ “,””) except: l[“country”]=None if(l[“port”] is not None): u.append(l) l={} print(u)

This will provide you a JSON response with three properties which are IP, port, and country. This proxy API will provide IPs according to a country code. you can find country code here .But for websites which have advanced bot detection mechanism, you have to use either mobile or residential proxies. you can again use Scrapingdog for such services. The number of IPs in the world is fixed. By using these services you will get access to millions of IPs which can be used to scrape millions of pages. This is the best thing you can do to scrape successfully for a longer period of time.

3. User-Agent


The User-Agent request header is a character string that lets servers and network peers identify the application, operating system, vendor, and / or version of the requesting user agent . Some websites block certain requests if they contain User-Agent that don’t belong to a major browser. If user-agents are not set many websites won’t allow viewing their content. You can get your user-agent by typing What is my user agent on google. You can also check your user-string here:

  1. List of web scraping proxy services
  2. List of handy web scraping tools
  3. Hotel Aggregator API
  4. Flight API
  5. Guide to web scraping

0
В избр. Сохранено
Авторизуйтесь
Вход с паролем
Комментарии
Первые Новые Популярные
Комментариев еще не оставлено
Выбрать файл
Блог проекта
Расскажите историю о создании или развитии проекта, поиске команды, проблемах и решениях
Написать
Личный блог
Продвигайте свои услуги или личный бренд через интересные кейсы и статьи
Написать

Spark использует cookie-файлы. С их помощью мы улучшаем работу нашего сайта и ваше взаимодействие с ним.