introduction - HackMD

# introduction ## section 1 - Web scraping is the automatic extraction of data from public websites that is then exported in a structured format (spreadsheet or a database<https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/>)in less cost and fast rate.<https://www.zyte.com/learn/what-is-web-scraping/> #### Advantages: - Supplying data for machine learning models, furthering the advancement of AI technology.<https://www.zyte.com/learn/what-is-web-scraping/> - Cost-effective: Saves money compared to manual collection - Time-saving: Automates data collection process - Supports improved decision making - Offers customization and flexibility - Enables scalability for projects of any size - Provides a competitive advantage over competitors - Facilitates integration with other systems [https://scrape-it.cloud/blog/pros-and-cons-of-web-scraping] #### When to use: 1. Lead Generation for Marketing 2. Price Comparison & Competition Monitoring 3. E-Commerce 4. Real Estate 5. Data Analysis 6. Academic Research 7. Training and Testing Data for Machine Learning Projects 8. Sports Betting Odds Analysis <https://www.webharvy.com/articles/web-scraper-use-cases.html> #### Types: Self-built or Pre-built Web Scrapers, Browser extension or Software Web Scrapers, and Cloud or Local Web Scrapers.<https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/> #### Why to use: It is useful when there is limited access to the data or if the website doesn't have an API. <https://www.zyte.com/learn/what-is-web-scraping/?_gl=1*ipwf7c*_up*MQ..*_ga*MTg0NzUxOTU3My4xNjk2MDk2MTE2*_ga_PC3KF6FQ4T*MTY5NjA5NjExNi4xLjAuMTY5NjA5NjExNi4wLjAuMA..#What-is-web-scraping-used-for?> #### most common techniques: - Human copy-and-paste. - Text pattern matching. - HTTP programming. - HTML parsing. - DOM parsing. - Vertical aggregation. - Semantic annotation recognizing. - Computer vision web-page analysis. - XPath: [https://www.imperva.com/learn/application-security/data-scraping/#:~:text=Data scraping%2C or web scraping,applications for automating data scraping.] <https://www.linkedin.com/pulse/what-different-types-web-scraping-approaches-rajat-thakur> ## section 2 - Competitor price monitoring Reacting quickly to changes, understanding the market and making informed pricing decisions by gathering up-to-date information on competitor prices. - Monitor Product Performance & inform product research and development Obtaining knowledge, track product performance based on pricing, inventory levels, and customer reviews and ratings. For instance, knowledge can be needed to adjust prices to remain competitive or to stop selling a product altogether due to changing customer preferences. - Better Advertisements Improving advertising, learning about target consumers from other merchants, online communities, or social media platforms. Web scraping can quickly compile A/B testing data from several paid platforms when applied to already-running campaigns. - Market analysis and machine learning Using text, images, and other data gathered through website scraping in market research or machine learning - Future Trends Predictions Determining new trends from articles in the news, blogs, social media, and websites of rival companies about what consumers want or pertinent trends (such as the season's fashion hues). - Competitor Analysis Keeping track of your competitors' performance across a range of factors, such as products, product categories, price, product and brand ratings, sale frequency, assortment, and more. Information may show market gaps by examining a variety of competitors. - Inventory management Extracting catalog information to add to the website, such as product details, sizing, and color, to assist maintain the inventory current and optimal. ## section 3: - Bot access: Content owners may disallow scraping by using a “Disallow” command in robots.txt. - Captcha or Login: Some websites utilize a Captcha or a login to gate certain types of content or to limit the number of requests can be made. - Web structures: Using complicated or constantly shifting web structures may stop web scraper work. Further, the use of "+" and drop-downs (e.g. sizing or color details) make it difficult to capture all the data. - Dynamic content: Readability by many scraper tools is not easy with dynamic pages that involve JavaScript or video content. - IP blocking or limiting: Some websites may block or impose a limit on actions per IP address to restrict scraping. - Load speed: Due to the content of the site or the number of requests, the load speed may cause the scraper to fail. <https://www.netsolutions.com/insights/top-reasons-to-use-python-web-scraping-for-your-ecommerce-store/#challenges-and-risks-of-python-web-scraping-for-ecommerce-stores> - Changing website formats and web page structures: Designers make scraping harder by creating more complex designs. Old structure will lead to collecting insufficient, irrelevant, or duplicate data. Scraper may crash if it's not able to handle the workload. - Large-scale extraction When data contains large number of subcategories the filtering and refining will take more work. - Anti-scraping programs Some web owners use anti-scraping tools to prevent scraping bots from accessing their data. <https://blog.apify.com/4-common-e-commerce-web-scraping-challenges-tips-to-navigate/> ## section 4: Actually, there are a lot of works that have tried to solve these problems with scraping, such as libraries and third parties that can use proxy, and a lot of tools that solve CAPTCHA, whether they are used manually or automatically. In addition to these, others can interact with JavaScript-based elements on websites and handle AJAX requests. ## section 5:- One of the most significant challenges in web scraping is the dynamic nature of websites. E-commerce platforms frequently update their site layouts, add new features, or change the HTML structure of their product pages. These changes can break the existing scraping tools, requiring update the algorithms constantly. This consumes time and requires expertise to adapt to the new structures efficiently. <https://multilogin.com/blog/large-scale-ecommerce-web-scraping/> Web page structures are widely divergent. So when you need to scrape multiple websites even different pages on the same platform, you might need to build one scraper for each site. As well as websites periodically update their content or add new features to improve the user experience and loading speed which often leads to structural changes on the web pages. A previous scraper might not work for an updated page because web scrapers are set up according to the design of the page. Sometimes even a minor change in the website will effect on the accuracy of the scraped data and require you to adjust the scraper. <https://www.octoparse.com/blog/9-web-scraping-challenges> Until now, most of existing softwares don't solve generic scraping challenge