BrainBeast

Hunt down knowledge with our help

Programming

How to cheat protection from bots to obtain http response

This article is for educational purposes only.

What is anti-bot protection?

In the context of web-surfing anti-bot protection is a set of methods that can identify request sent by robot and either block response or change it. These methods can be used on both server- and client-side code. It can prevent website from being parsed and spammed by web-automated systems. In rare cases webrobots developers programm bots to obey what is written in robots.txt. For instance, Google bot.

What is difference between robot and human requests?

Mostly everyone uses web-browser to surf the net. When user goes to a website from organic search results or directly, typing url address in the search bar, browser automatically adds some HTTP headers to GET requests. See full list of a common http headers that can apper in browser requests. Also real users usually interact with a webpage: clicking on elements, scrolling, resizing and so on.

The most popular way to identify a bot request is to check HTTP headers: if, for instance, user-agent header is not defined, the request is trully not sent by browser (Chrome, Firefox et cetera). So, this is the simplest method to block non-human request: if one of default ones headers is not found in http request headers list – ignore the request in backend.

The other wat to twist round robot’s little finger is to track non-human behavior and hide some part of a content. For example, delay the load of a contact form part before user does not perform some actions (clicking or scrolling) after the main content is loaded. The behavior tracking is a task of client-side js code.
Protection From Bots

How to ensure the content source that request is performed by human, not robot?

Obviosly, you can programm http-client to add default headers to requests. In most cases it is enough to add User-Agent, Accept-Language and Referer. The example with c# .Net Framework HttpWebRequest class:

public string ReadResp(string url)
        {
            try
            {
                HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
                request.Referer = "https://www.google.com/";
                ServicePoint sp = request.ServicePoint;
                ServicePointManager.DefaultConnectionLimit = 10;
                ServicePointManager.ServerCertificateValidationCallback = (s, cert, chain, ssl) = > true;
                ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
                request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate | DecompressionMethods.None;
                request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36";
                request.Accept = "text/html,application/xhtml+xml,application/xml;";
                request.Headers.Add(HttpRequestHeader.AcceptLanguage, "ru-RU,ru;q=0.5,en-US;q=0.9,en;q=0.8,uk;");
                sp.CloseConnectionGroup(request.ConnectionGroupName);
                sp.ConnectionLeaseTimeout = 60000;
                HttpWebResponse response = (HttpWebResponse)request.GetResponse();
                Stream stream = response.GetResponseStream();
                StreamReader reader = new StreamReader(stream);
                string htmlR = reader.ReadToEnd();
                reader.Close();
                stream.Close();
                response.Close();
                request.Abort();
                return htmlR;
            }
            catch
            {
                return null;
            }
        }

This can help to break through the backend protection and get a response. But if there is a client-side protection too, robot should be able to simulate a human activity. This is not a secret that a simple http client can’t perform clicks on webpage and run a javascript code… Therefore we will need to use a browser. Firstly, browser would add headers automatically (no need to take care of User-Agent and others). Secondly, it would load and execute javascript. Finally, if it is programmed to act as a human (do some clicks for example), it can obtain content that is hidden for robots.

Also there is a disadvantage: this methods works much slowlier because it uses entire browser that, apart form plain http client, loads all external files, resources and runs js.

To use browser programmatically Webdriver is used. Complete necessary steps to use Selenium WebDriver in .Net.

The example of a method that simulates human behavior and gets html content using Selenium Webdriver in c#:

public string ReadRespWithWebDriver(string url)
{
   IWebDriver driver;

   ChromeOptions options = new ChromeOptions();
                options.AddArguments(new List < string > () { "headless", "disable-gpu" });
                var chromeDriverService = ChromeDriverService.CreateDefaultService();
                chromeDriverService.HideCommandPromptWindow = true;
                driver = new ChromeDriver(chromeDriverService, options);
                driver.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(10);

  Actions action = new Actions(driver);
                action.MoveByOffset(5, 5).MoveByOffset(10, 15).MoveByOffset(20, 15);

  driver.Navigate().GoToUrl(url);
                    action.Perform();
                    return driver.PageSource;
}

What to use to look like a human in web?

You will have the biggest effect in performance using both plain http and webdriver methods in web scraping. Firstly try the fastest one – plain http client. At the end switch on WebDriver to scrap what is left.

Related Posts