The process of accessing the target website by the crawler will consume the target system resources, resulting in a decrease in the performance of the target system, which in severe cases may cause service paralysis, so the crawler needs to take into account the crawl strategy planning (including crawl frequency, crawl content, the target load, target restrictions), The robot exclusion protocol is up to the target manager to decide which pages the crawler can collect data or reach. Public web managers who do not want to be accessed by crawlers can use the robots.txt file-set method to avoid access.
正在翻譯中..
