Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale

Ebook539 pages3 hours

Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale

Name: Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale
Author: Jay M. Patel
ISBN: 9781484265765

By Jay M. Patel

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Utilize web scraping at scale to quickly get unlimited amounts of free data available on the web into a structured format. This book teaches you to use Python scripts to crawl through websites at scale and scrape data from HTML and JavaScript-enabled pages and convert it into structured data formats such as CSV, Excel, JSON, or load it into a SQL database of your choice.

This book goes beyond the basics of web scraping and covers advanced topics such as natural language processing (NLP) and text analytics to extract names of people, places, email addresses, contact details, etc., from a page at production scale using distributed big data techniques on an Amazon Web Services (AWS)-based cloud infrastructure. It book covers developing a robust data processing and ingestion pipeline on the Common Crawl corpus, containing petabytes of data publicly available and a web crawl data set available on AWS's registry of open data.

Getting Structured Data from the Internet also includes a step-by-step tutorial on deploying your own crawlers using a production web scraping framework (such as Scrapy) and dealing with real-world issues (such as breaking Captcha, proxy IP rotation, and more). Code used in the book is provided to help you understand the concepts in practice and write your own web crawler to power your business ideas.

What You Will Learn

Understand web scraping, its applications/uses, and how to avoid web scraping by hitting publicly available rest API endpoints to directly get data
Develop a web scraper and crawler from scratch using lxml and BeautifulSoup library, and learn about scraping from JavaScript-enabled pages using Selenium
Use AWS-based cloud computing with EC2, S3, Athena, SQS, and SNS to analyze, extract, and store useful insights from crawled pages
Use SQL language on PostgreSQL running on Amazon Relational Database Service (RDS) and SQLite using SQLalchemy
Review sci-kit learn, Gensim, and spaCy to perform NLP tasks on scraped web pages such as name entity recognition, topic clustering (Kmeans, Agglomerative Clustering), topic modeling (LDA, NMF, LSI), topic classification (naive Bayes, Gradient Boosting Classifier) and text similarity (cosine distance-based nearest neighbors)
Handle web archival file formats and explore Common Crawl open data on AWS
Illustrate practical applications for web crawl data by building a similar website tool and a technology profiler similar to builtwith.com
Write scripts to create a backlinks database on a web scale similar to Ahrefs.com, Moz.com, Majestic.com, etc., for search engine optimization (SEO), competitor research, and determining website domain authority and ranking
Use web crawl data to build a news sentiment analysis system or alternative financial analysis covering stock market trading signals
Write a production-ready crawlerin Python using Scrapy framework and deal with practical workarounds for Captchas, IP rotation, and more

Who This Book Is For

Primary audience: data analysts and scientists with little to no exposure to real-world data processing challenges, secondary: experienced software developers doing web-heavy data processing who need a primer, tertiary: business owners and startup founders who need to know more about implementation to better direct their technical team

Skip carousel

LanguageEnglish

PublisherApress

Release dateNov 12, 2020

ISBN9781484265765

Author

Jay M. Patel

Related authors

Skip carousel

Related to Getting Structured Data from the Internet

Related ebooks

Skip carousel

Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
Ebook
Practical hapi: Build Your Own hapi Apps and Learn from Industry Case Studies
byKanika Sud
Rating: 0 out of 5 stars
0 ratings
Build Your Own IoT Platform: Develop a Fully Flexible and Scalable Internet of Things Platform in 24 Hours
Ebook
Build Your Own IoT Platform: Develop a Fully Flexible and Scalable Internet of Things Platform in 24 Hours
byAnand Tamboli
Rating: 0 out of 5 stars
0 ratings
Ext JS Application Development Blueprints
Ebook
Ext JS Application Development Blueprints
byColin Ramsay
Rating: 0 out of 5 stars
0 ratings
Python Web Scraping - Second Edition
Ebook
Python Web Scraping - Second Edition
byKatharine Jarmul
Rating: 5 out of 5 stars
5/5
Implementing OpenShift
Ebook
Implementing OpenShift
byEuan Cook
Rating: 0 out of 5 stars
0 ratings
Building a RESTful Web Service with Spring
Ebook
Building a RESTful Web Service with Spring
byDewailly Ludovic
Rating: 5 out of 5 stars
5/5
Building Blockchain Projects
Ebook
Building Blockchain Projects
byNarayan Prusty
Rating: 2 out of 5 stars
2/5
Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing
Ebook
Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing
byRyan Wade
Rating: 0 out of 5 stars
0 ratings
Website Scraping with Python: Using BeautifulSoup and Scrapy
Ebook
Website Scraping with Python: Using BeautifulSoup and Scrapy
byGábor László Hajba
Rating: 0 out of 5 stars
0 ratings
Implementing Cloud Design Patterns for AWS
Ebook
Implementing Cloud Design Patterns for AWS
byMarcus Young
Rating: 0 out of 5 stars
0 ratings
Practical ASP.NET Web API
Ebook
Practical ASP.NET Web API
byBadrinarayanan Lakshmiraghavan
Rating: 0 out of 5 stars
0 ratings
PHP 7 Programming Blueprints
Ebook
PHP 7 Programming Blueprints
byJose Palala
Rating: 0 out of 5 stars
0 ratings
Mining the Web: Discovering Knowledge from Hypertext Data
Ebook
Mining the Web: Discovering Knowledge from Hypertext Data
bySoumen Chakrabarti
Rating: 4 out of 5 stars
4/5
ASP.NET Core 3 and React: Hands-On full stack web development using ASP.NET Core, React, and TypeScript 3
Ebook
ASP.NET Core 3 and React: Hands-On full stack web development using ASP.NET Core, React, and TypeScript 3
byCarl Rippon
Rating: 0 out of 5 stars
0 ratings
Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud
Ebook
Monetizing Machine Learning: Quickly Turn Python ML Ideas into Web Applications on the Serverless Cloud
byManuel Amunategui
Rating: 0 out of 5 stars
0 ratings
Beginning Power Apps: The Non-Developer's Guide to Building Business Applications
Ebook
Beginning Power Apps: The Non-Developer's Guide to Building Business Applications
byTim Leung
Rating: 0 out of 5 stars
0 ratings
API Development: A Practical Guide for Business Implementation Success
Ebook
API Development: A Practical Guide for Business Implementation Success
bySascha Preibisch
Rating: 0 out of 5 stars
0 ratings
Data Analytics with SAS: Explore your data and get actionable insights with the power of SAS (English Edition)
Ebook
Data Analytics with SAS: Explore your data and get actionable insights with the power of SAS (English Edition)
byNishant Sidana
Rating: 0 out of 5 stars
0 ratings
ASP.NET Web API Security Essentials
Ebook
ASP.NET Web API Security Essentials
byGunasundaram Rajesh
Rating: 0 out of 5 stars
0 ratings
Digital transformation with dataverse: Become a citizen developer and lead the digital transformation wave with Microsoft Teams and Power Platform
Ebook
Digital transformation with dataverse: Become a citizen developer and lead the digital transformation wave with Microsoft Teams and Power Platform
byAaron Brooke
Rating: 0 out of 5 stars
0 ratings
Hands-on MuleSoft Anypoint platform Volume 1
Ebook
Hands-on MuleSoft Anypoint platform Volume 1
byNanda Nachimuthu
Rating: 5 out of 5 stars
5/5
Practical OneOps
Ebook
Practical OneOps
byNilesh Nimkar
Rating: 0 out of 5 stars
0 ratings
Mastering Google App Engine
Ebook
Mastering Google App Engine
byHijazee Mohsin Shafique
Rating: 0 out of 5 stars
0 ratings
Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture
Ebook
Scalable Big Data Architecture: A practitioners guide to choosing relevant Big Data architecture
byBahaaldine Azarmi
Rating: 0 out of 5 stars
0 ratings
ASP.NET Core for Jobseekers: Build Career in Designing Cross-Platform Web Applications Using Razor and Entity Framework Core
Ebook
ASP.NET Core for Jobseekers: Build Career in Designing Cross-Platform Web Applications Using Razor and Entity Framework Core
byKemal Birer
Rating: 0 out of 5 stars
0 ratings
Distributed Computing in Java 9
Ebook
Distributed Computing in Java 9
byRaja Malleswara Rao Pattamsetti
Rating: 0 out of 5 stars
0 ratings
Practical API Architecture and Development with Azure and AWS: Design and Implementation of APIs for the Cloud
Ebook
Practical API Architecture and Development with Azure and AWS: Design and Implementation of APIs for the Cloud
byThurupathan Vijayakumar
Rating: 0 out of 5 stars
0 ratings
Learning Bing Maps API
Ebook
Learning Bing Maps API
byArtan Sinani
Rating: 0 out of 5 stars
0 ratings
Getting Started with WebRTC
Ebook
Getting Started with WebRTC
byRob Manson
Rating: 0 out of 5 stars
0 ratings
The New Frontier In Web Api Programming
Ebook
The New Frontier In Web Api Programming
byRick Mac Gillis
Rating: 0 out of 5 stars
0 ratings

Databases For You

Skip carousel

Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
Ebook
Summary of Building a Second Brain: by Tiago Forte - A Proven Method to Organize Your Digital Life and Unlock Your Creative Potential - A Comprehensive Summary
byAlexander Cooper
Rating: 1 out of 5 stars
1/5
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
Ebook
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator
byBrian Peasland
Rating: 0 out of 5 stars
0 ratings
100+ SQL Queries T-SQL for Microsoft SQL Server
Ebook
100+ SQL Queries T-SQL for Microsoft SQL Server
byIFS Harrison
Rating: 4 out of 5 stars
4/5
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
Ebook
SQL Programming & Database Management For Absolute Beginners SQL Server, Structured Query Language Fundamentals: "Learn - By Doing" Approach And Master SQL
byWilliam Sullivan
Rating: 5 out of 5 stars
5/5
Learn SQL Server Administration in a Month of Lunches
Ebook
Learn SQL Server Administration in a Month of Lunches
byDon Jones
Rating: 3 out of 5 stars
3/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Practical Data Analysis
Ebook
Practical Data Analysis
byHector Cuesta
Rating: 4 out of 5 stars
4/5
Learn SQL in 24 Hours
Ebook
Learn SQL in 24 Hours
byAlex Nordeen
Rating: 5 out of 5 stars
5/5
Access 2010 All-in-One For Dummies
Ebook
Access 2010 All-in-One For Dummies
byAlison Barrows
Rating: 4 out of 5 stars
4/5
Excel 2021
Ebook
Excel 2021
byJIAYI SIMONDS
Rating: 4 out of 5 stars
4/5
Building a Scalable Data Warehouse with Data Vault 2.0
Ebook
Building a Scalable Data Warehouse with Data Vault 2.0
byDaniel Linstedt
Rating: 4 out of 5 stars
4/5
SQL: Practical Guide for Developers
Ebook
SQL: Practical Guide for Developers
byMichael J. Donahoo
Rating: 2 out of 5 stars
2/5
Blockchain Basics: A Non-Technical Introduction in 25 Steps
Ebook
Blockchain Basics: A Non-Technical Introduction in 25 Steps
byDaniel Drescher
Rating: 5 out of 5 stars
5/5
Joe Celko’s Complete Guide to NoSQL: What Every SQL Professional Needs to Know about Non-Relational Databases
Ebook
Joe Celko’s Complete Guide to NoSQL: What Every SQL Professional Needs to Know about Non-Relational Databases
byJoe Celko
Rating: 4 out of 5 stars
4/5
Access 2019 For Dummies
Ebook
Access 2019 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Query Store for SQL Server 2019: Identify and Fix Poorly Performing Queries
Ebook
Query Store for SQL Server 2019: Identify and Fix Poorly Performing Queries
byTracy Boggiano
Rating: 0 out of 5 stars
0 ratings
Node.js Design Patterns - Second Edition
Ebook
Node.js Design Patterns - Second Edition
byMario Casciaro
Rating: 4 out of 5 stars
4/5
Python Projects for Everyone
Ebook
Python Projects for Everyone
byMohamad Charara
Rating: 0 out of 5 stars
0 ratings
Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing
Ebook
Advanced Analytics in Power BI with R and Python: Ingesting, Transforming, Visualizing
byRyan Wade
Rating: 0 out of 5 stars
0 ratings
Learning PostgreSQL
Ebook
Learning PostgreSQL
byJuba Salahaldin
Rating: 1 out of 5 stars
1/5
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
Ebook
Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program
byJohn Ladley
Rating: 4 out of 5 stars
4/5
Data Lake Development with Big Data
Ebook
Data Lake Development with Big Data
byPasupuleti Pradeep
Rating: 0 out of 5 stars
0 ratings
Access for Beginners: Access Essentials, #1
Ebook
Access for Beginners: Access Essentials, #1
byM.L. Humphrey
Rating: 0 out of 5 stars
0 ratings
Access 2016 For Dummies
Ebook
Access 2016 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
Learn Git in a Month of Lunches
Ebook
Learn Git in a Month of Lunches
byRick Umali
Rating: 0 out of 5 stars
0 ratings
Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework
Ebook
Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework
byLaura Sebastian-Coleman
Rating: 5 out of 5 stars
5/5
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
Ebook
LINUX: Beginner's Crash Course. Your Step-By-Step Guide To Learning The Linux Operating System And Command Line Easy & Fast!
byJeremy Li
Rating: 3 out of 5 stars
3/5
A Concise Guide to Object Orientated Programming
Ebook
A Concise Guide to Object Orientated Programming
byalasdair gilchrist
Rating: 0 out of 5 stars
0 ratings
SQL Server: Tips and Tricks - 2
Ebook
SQL Server: Tips and Tricks - 2
byPriyanka Agarwal
Rating: 4 out of 5 stars
4/5

Related podcast episodes

Skip carousel

Understanding Time-Series Database Patterns
Podcast episode
Understanding Time-Series Database Patterns
byThe Cloudcast
0 ratings
0% found this document useful
How Redpanda Extracts Business Value from Data Events with Alex Gallego
Podcast episode
How Redpanda Extracts Business Value from Data Events with Alex Gallego
byScreaming in the Cloud
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
Potluck - WordPress × 3rd-Party Cloud Services × Backend Hosting × Drupal × Getting Clients × GPS vs BEM × More!: It’s another Potluck! In this episode, Scott and Wes answer your questions about WordPress, Drupal, using SSGs, finding clients when you’re just starting out, scoped CSS, and more! Prismic - Sponsor Prismic is a Headless CMS that makes it easy to...
Podcast episode
Potluck - WordPress × 3rd-Party Cloud Services × Backend Hosting × Drupal × Getting Clients × GPS vs BEM × More!: It’s another Potluck! In this episode, Scott and Wes answer your questions about WordPress, Drupal, using SSGs, finding clients when you’re just starting out, scoped CSS, and more! Prismic - Sponsor Prismic is a Headless CMS that makes it easy to...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch: Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.
Podcast episode
An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch: Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.
byData Engineering Podcast
0 ratings
0% found this document useful
Ep. 33 - Code dependencies are the devil: Have you built your app on someone else's code? And beyond that, does the "secret sauce" of your product depend on external libraries or frameworks? While it's tempting to use the latest and greatest tech as soon as it comes out, that's not always a...
Podcast episode
Ep. 33 - Code dependencies are the devil: Have you built your app on someone else's code? And beyond that, does the "secret sauce" of your product depend on external libraries or frameworks? While it's tempting to use the latest and greatest tech as soon as it comes out, that's not always a...
byfreeCodeCamp Podcast
0 ratings
0% found this document useful
WBSP184: Grow Your Business by Learning the Best Practices of Data-Sharing Across Multiple Entities, a Live Interview w/ a Panel of Experts
Podcast episode
WBSP184: Grow Your Business by Learning the Best Practices of Data-Sharing Across Multiple Entities, a Live Interview w/ a Panel of Experts
byWBSRocks: Business Growth with ERP and Digital Transformation
0 ratings
0% found this document useful
Big Data, Data Lakes, and Blockchain with Rahul Pathak, Executive at Amazon Web Services: Everyone knows that data is exploding. What most people don’t realize is the pace and ways in which data is changing our everyday lives. According to , we’re seeing a “roughly 10x increase in data every 5 years, and the types of data that’s...
Podcast episode
Big Data, Data Lakes, and Blockchain with Rahul Pathak, Executive at Amazon Web Services: Everyone knows that data is exploding. What most people don’t realize is the pace and ways in which data is changing our everyday lives. According to , we’re seeing a “roughly 10x increase in data every 5 years, and the types of data that’s...
byMission Daily
0 ratings
0% found this document useful
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
Podcast episode
Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI: The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.
byData Engineering Podcast
0 ratings
0% found this document useful
API First, Lifecycles and Governance
Podcast episode
API First, Lifecycles and Governance
byThe Cloudcast
0 ratings
0% found this document useful
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
Podcast episode
[DataFramed Careers Series #3]: Accelerating Data Careers with Writing
byDataFramed
0 ratings
0% found this document useful
07: Brian Leonard: Be friends with engineering with open source Martech: There's a lot lost when we think of marketers and engineers as separate things and not the organization as a whole. The right thing to do is engage with the engineers that power your marketing tech stack. And meet them where they are. Open source martech
Podcast episode
07: Brian Leonard: Be friends with engineering with open source Martech: There's a lot lost when we think of marketers and engineers as separate things and not the organization as a whole. The right thing to do is engage with the engineers that power your marketing tech stack. And meet them where they are. Open source martech
byHumans of Martech
0 ratings
0% found this document useful
WBSP178: Grow Your Business by Learning the Importance of Source of Truth, a Live Interview w/ a Panel of Experts
Podcast episode
WBSP178: Grow Your Business by Learning the Importance of Source of Truth, a Live Interview w/ a Panel of Experts
byWBSRocks: Business Growth with ERP and Digital Transformation
0 ratings
0% found this document useful
How Data Discovery is Changing the Game with Shinji Kim: Shinji Kim, CEO and Co-Founder of Select Star, joins Corey to talk about the fast-growing world of data discovery. Shinji presents the question that Select Star answers, “How discoverable is your data?” and explains how Select Star is differentiating itse
Podcast episode
How Data Discovery is Changing the Game with Shinji Kim: Shinji Kim, CEO and Co-Founder of Select Star, joins Corey to talk about the fast-growing world of data discovery. Shinji presents the question that Select Star answers, “How discoverable is your data?” and explains how Select Star is differentiating itse
byScreaming in the Cloud
0 ratings
0% found this document useful
ProcurementSoftware.site – The FREE resource for digital procurement
Podcast episode
ProcurementSoftware.site – The FREE resource for digital procurement
byThe Procurement Software Podcast
0 ratings
0% found this document useful
A "AI & ML" Look Ahead for 2020
Podcast episode
A "AI & ML" Look Ahead for 2020
byThe Cloudcast
0 ratings
0% found this document useful
End-to-End Data Science to Drive Business Decisions at LinkedIn with Burcu Baran - TWiML Talk #256: In this episode of our Strata Data conference series, we’re joined by Burcu Baran, Senior Data Scientist at LinkedIn. At Strata, Burcu, along with a few members of her team, delivered the presentation “Using the full spectrum of data science to...
Podcast episode
End-to-End Data Science to Drive Business Decisions at LinkedIn with Burcu Baran - TWiML Talk #256: In this episode of our Strata Data conference series, we’re joined by Burcu Baran, Senior Data Scientist at LinkedIn. At Strata, Burcu, along with a few members of her team, delivered the presentation “Using the full spectrum of data science to...
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
Putting machine learning into a database: Most data scientists bounce back and forth regula…
Podcast episode
Putting machine learning into a database: Most data scientists bounce back and forth regula…
byLinear Digressions
0 ratings
0% found this document useful
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
Podcast episode
69: Testing Front End Code: Summary Oren Rubin (@Shexman) goes through why it’s important to not only test the back-end code of our applications but also to test our Front End code, the integration points, and the full user experience. Oren also goes through...
byThe Web Platform Podcast
0 ratings
0% found this document useful
Opening AI's Black Box with Prof. David Bau, Koyena Pal, and Eric Todd of Northeastern University: In this episode, we dive deep into the inner workings of large language models with Professor David Bau and grad students Koyena Pal and Eric Todd from Northeastern University.
Podcast episode
Opening AI's Black Box with Prof. David Bau, Koyena Pal, and Eric Todd of Northeastern University: In this episode, we dive deep into the inner workings of large language models with Professor David Bau and grad students Koyena Pal and Eric Todd from Northeastern University.
by"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis
0 ratings
0% found this document useful
Doing DevRel on Easy Mode with Matty Stratton: Corey’s good friend Matt “Matty” Stratton, now a Staff Developer Advocate at Pulumi, is back for another round of “Screaming!” Now, with a job title that sits at the top of a “very strange career trajectory.” With beginnings at Chef, to IMB, and now Pulum
Podcast episode
Doing DevRel on Easy Mode with Matty Stratton: Corey’s good friend Matt “Matty” Stratton, now a Staff Developer Advocate at Pulumi, is back for another round of “Screaming!” Now, with a job title that sits at the top of a “very strange career trajectory.” With beginnings at Chef, to IMB, and now Pulum
byScreaming in the Cloud
0 ratings
0% found this document useful
Blind Spots: We talk with Laura Klein author of UX for Lean Startups and Build Better Products
Podcast episode
Blind Spots: We talk with Laura Klein author of UX for Lean Startups and Build Better Products
byRocketship.fm
0 ratings
0% found this document useful
Discussing Service Mesh Architectures
Podcast episode
Discussing Service Mesh Architectures
byThe Cloudcast
0 ratings
0% found this document useful
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
Podcast episode
Build A Data Lake For Your Security Logs With Scanner: Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.
byData Engineering Podcast
0 ratings
0% found this document useful
Why Executives Should Keep Up with AI Trends in Business: I hope that by the end of this episode of the AI in Industry podcast, you'll not only be able to hire better data scientists who will be a fit for your business problems and build better data science teams, but also pick the AI applications and use...
Podcast episode
Why Executives Should Keep Up with AI Trends in Business: I hope that by the end of this episode of the AI in Industry podcast, you'll not only be able to hire better data scientists who will be a fit for your business problems and build better data science teams, but also pick the AI applications and use...
byThe AI in Business Podcast
0 ratings
0% found this document useful
Pre-launch Checklist: In this episode Wes and Scott discuss their pre-launch checklists. They talk about performance, accessibility, compatibility, SEO, analytics, and more - all the things you should check before launching something to the world. Sentry - Sponsor If you...
Podcast episode
Pre-launch Checklist: In this episode Wes and Scott discuss their pre-launch checklists. They talk about performance, accessibility, compatibility, SEO, analytics, and more - all the things you should check before launching something to the world. Sentry - Sponsor If you...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Do generated types from OpenAPI spec change testing?: Jordan asked this on 2024-04-10
Podcast episode
Do generated types from OpenAPI spec change testing?: Jordan asked this on 2024-04-10
byThe Call Kent Podcast
0 ratings
0% found this document useful
Smart Cities: DAOs and Data Transparency with Dave Connor: Dave Connor, a DAO member at API3, discusses API3 and decentralized oracles, Smart Cities, DAOs, real-world examples of decentralizing data, and much more.
Podcast episode
Smart Cities: DAOs and Data Transparency with Dave Connor: Dave Connor, a DAO member at API3, discusses API3 and decentralized oracles, Smart Cities, DAOs, real-world examples of decentralizing data, and much more.
byThe Charlie Shrem Show
0 ratings
0% found this document useful
DevOps and Incident Response Evolution
Podcast episode
DevOps and Incident Response Evolution
byThe Cloudcast
0 ratings
0% found this document useful
Potluck — VSCode × Vercel vs Netlify × Models × Mutations × Multi-Vendor Platforms × Websites vs Web Apps × More!: It’s another potluck! In this episode, Scott and Wes answer your questions about VSCode, Vercel vs Netlify, staying up to date with dev concepts, models and mutations, websites vs seb apps, adaptive vs responsive design, and more! Freshbooks -...
Podcast episode
Potluck — VSCode × Vercel vs Netlify × Models × Mutations × Multi-Vendor Platforms × Websites vs Web Apps × More!: It’s another potluck! In this episode, Scott and Wes answer your questions about VSCode, Vercel vs Netlify, staying up to date with dev concepts, models and mutations, websites vs seb apps, adaptive vs responsive design, and more! Freshbooks -...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful

Skip carousel

Inform And Enhance Your Business With Open Data
PC Pro Magazine
Article
Inform And Enhance Your Business With Open Data
Jun 10, 2021
7 min read
“Since Covid, Aren’t We All Bedroom Workers To A Greater Or Lesser Extent?”
PC Pro Magazine
Article
“Since Covid, Aren’t We All Bedroom Workers To A Greater Or Lesser Extent?”
Jan 6, 2022
If you’ve read this column over the past year or so, you’d be forgiven for thinking that I just spend all my time playing with phones, home automation and various gadgets. But if you cast your eyes over to the right for a moment, you’ll notice that u
9 min read
How To Setup A Killer Wensite In 2022
PC Pro Magazine
Article
How To Setup A Killer Wensite In 2022
Jan 6, 2022
8 min read
“If ‘Show Password’ Is Enabled, The Feature Sends Your Password To Their Third-party Servers”
PC Pro Magazine
Article
“If ‘Show Password’ Is Enabled, The Feature Sends Your Password To Their Third-party Servers”
Dec 8, 2022
Like most people who write for a living, I lean heavily on my spoil chicken to get me through the day. Sorry, I mean spell checker. It’s not just professional writers, either: spell checkers have become de rigueur for business users and consumers ali
7 min read
We Programmed ChatGPT Into This Article. It’s Weird.
The Atlantic
Article
We Programmed ChatGPT Into This Article. It’s Weird.
Mar 9, 2023
7 min read
Browser Wars 2020
Maximum PC
Article
Browser Wars 2020
May 26, 2020
8 min read
Stay Safe Online!
Linux Format
Article
Stay Safe Online!
Jan 9, 2024
19 min read
Browser wars 2020
APC
Article
Browser wars 2020
Nov 2, 2020
8 min read
PyScript – Bring Python Coding To The Web
APC
Article
PyScript – Bring Python Coding To The Web
Aug 8, 2022
4 min read
“There’s Something About Online Meetings That Makes People More Willing To Engage With Each Other”
PC Pro Magazine
Article
“There’s Something About Online Meetings That Makes People More Willing To Engage With Each Other”
Oct 8, 2020
9 min read
Browser Wars 2020
Linux Format
Article
Browser Wars 2020
Jun 30, 2020
8 min read
An Expert Speaks Up on What You Should Know About Programming Languages
Entrepreneur
Article
An Expert Speaks Up on What You Should Know About Programming Languages
Oct 1, 2015
1 min read
Browser Wars 2020
TechLife
Article
Browser Wars 2020
Aug 24, 2020
8 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
How Can AI Help Your Business?
PC Pro Magazine
Article
How Can AI Help Your Business?
Jun 8, 2023
7 min read
Seven Ways To Future-proof Your SEO Strategy
Marketing
Article
Seven Ways To Future-proof Your SEO Strategy
Apr 8, 2018
Search engine optimisation (SEO) is always changing. To stay ahead of your competitors you need to be able to shift your SEO strategy. Expect to see mobile devices, artificial intelligence (AI) and voice search dominating the news. But what practical
3 min read
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
TechLife News
Article
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
Dec 16, 2023
4 min read
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
AppleMagazine
Article
AWS Chief Adam Selipsky Talks Generative AI, Amazon’s Investment In Anthropic And Cloud Cost Cutting
Dec 15, 2023
4 min read
Scan And Scrape Websites Using Python
Linux Format
Article
Scan And Scrape Websites Using Python
Nov 14, 2023
David Bolton once accidentally boosted the traffic for his firm’s website by 25% in one day by running a web scraper on it. Luckily, they never found out! Ever since the web made an appearance back in the mid-’90s, programmers have been writing softw
6 min read
MacOS High Sierra: The New Safari Takes Steps to Reduce Persistent User Tracking, but Is It Enough?
MacWorld
Article
MacOS High Sierra: The New Safari Takes Steps to Reduce Persistent User Tracking, but Is It Enough?
Aug 18, 2017
5 min read
How Apple Sweats The Security Details – And Sometimes Gets It Wrong
Macworld UK
Article
How Apple Sweats The Security Details – And Sometimes Gets It Wrong
Jan 15, 2021
3 min read
“We Might Beliving On ‘The Edge’, But That’s A Passing Label That Now Only Reflects A By Gone Way Of Working”
PC Pro Magazine
Article
“We Might Beliving On ‘The Edge’, But That’s A Passing Label That Now Only Reflects A By Gone Way Of Working”
Aug 13, 2020
8 min read
A.i. Coding
Linux Format
Article
A.i. Coding
Aug 22, 2023
16 min read
Make AI Work For You
Linux Format
Article
Make AI Work For You
Apr 2, 2024
8 min read
Web App Security
Linux Format
Article
Web App Security
Jun 29, 2021
8 min read
Artificial Intelligence Rules Of The Road
Linux Format
Article
Artificial Intelligence Rules Of The Road
Nov 14, 2023
AI FOR ALL! Anyone who works with computers needs to understand that AI will undoubtedly change how work is executed. That said, I don’t think we are anywhere near the much bleated “Everyone will lose their jobs!” IT-related jobs will change but they
2 min read
Accounting Software – Time To Switch?
PC Pro Magazine
Article
Accounting Software – Time To Switch?
Mar 9, 2023
7 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
Enterprise Soaring Success
Linux Format
Article
Enterprise Soaring Success
Aug 27, 2019
7 min read
AI As A Service
PC Pro Magazine
Article
AI As A Service
Jul 9, 2020
2 min read

Related categories

Skip carousel

Reviews for Getting Structured Data from the Internet

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Getting Structured Data from the Internet - Jay M. Patel

J. M. PatelGetting Structured Data from the Internethttps://doi.org/10.1007/978-1-4842-6576-5_1

1. Introduction to Web Scraping

Jay M. Patel¹

(1)

Specrom Analytics, Ahmedabad, India

In this chapter, you will learn about the common use cases for web scraping. The overall goal of this book is to take raw web crawls and transform them into structured data which can be used for providing actionable insights. We will demonstrate applications of such a structured data from a REST API endpoint by performing sentiment analysis on Reddit comments. Lastly, we will talk about the different steps of the web scraping pipeline and how we are going to explore them in this book.

Who uses web scraping?

Let’s go through examples and use cases for web scraping in different industry domains. This is by no means an exhaustive listing, but I have made an effort to provide examples that crawl a handful of websites to those that need crawling a major portion of the visible Internet (web-sized crawls).

Marketing and lead generation

Companies like Hunter.io, Voila Norbert, and FindThatLead run crawlers that index a large portion of the visible Internet, and they extract email addresses, person names, and so on to populate an email marketing and lead generation database. They provide an email address lookup service where a user can enter a domain address and the contacts listed in their database for a lookup fee of $0.0098–$0.049 per contact. As an example, let us enter my personal website’s address (jaympatel.com) and see the emails it found on that domain address (see Figure 1-1).

../images/498113_1_En_1_Chapter/498113_1_En_1_Fig1_HTML.jpg

Figure 1-1

Hunter.io screenshot

Hunter.io also provides an email finder service where a user can enter the first and last name of a person of interest at a particular domain address, and it can predict the email address for them based on pattern matching (see Figure 1-2).

../images/498113_1_En_1_Chapter/498113_1_En_1_Fig2_HTML.jpg

Figure 1-2

Hunter.io screenshot

Search engines

General-purpose search engines like Google, Bing, and so on run large-scale web scrapers called web crawlers which go out and grab billions of web pages and index and rank them according to various natural language processing and web graph algorithms, which not only power their core search functionality but also products like Google advertising, Google translate, and so on. I know you may be thinking that you have no plans to start another Google, and that’s probably a wise decision, but you should be interested in ranking your business’s website higher on Google. This need for being high enough on search engine rankings has spurned off a lot of web scraping/crawling businesses, which I will discuss in the next couple of sections.

On-site search and recommendation

Many websites use third-party providers to power the search box on their website. These are called on-site searching in our industry, and some of the SaaS providers are Algolia, Swiftype, and Specrom.

The idea behind all of the on-site searching is simple; they run web crawlers which only target one site, and using algorithms inspired by search engines, they return search engine results pages based on search queries.

Usually, there is also a JavaScript plugin so that the users can get autocomplete for their entered queries. Pricing is usually based on the number of queries sent as well as the size of the website with a range of $20 to as high as $70 a month for a typical site.

Many websites and apps also perform on-site searching in house, and the typical technology stacks are based on Elasticsearch, Apache Solr, or Amazon CloudSearch.

A slightly different product is the content recommendation where the same crawled information is used to power a widget which shows the most similar content to the one on the current page.

Google Ads and other pay-per-click (PPC) keyword research tools

Google Ads is an online advertising platform which predominantly sells ads that are frequently known in the digital marketing field as pay-per-click (PPC) where the advertiser pays for ads based on the number of clicks received on the ads, rather than on the number of times a particular ad is shown, which is known as impressions.

Google, like most PPC advertising platforms, makes money every time a user clicks on one of their ads. Therefore, it’s in the best interest of Google to maximize the ratio of clicks per impressions or click-through rate (CTR).

However, businesses make money every time one of those clicked users take an action such as converting into a lead by filling out a form, buying products from your ecommerce store, or personally visiting your brick-and-mortar store or restaurant. This is known as a conversion. A conversion value is the amount of revenue your business earns from a given conversion.

The real metric advertisers care about is the return on ad spend or ROAS which can be defined as the total conversion value divided by your advertising costs. Google makes money based on the number of clicks or impressions, but an advertiser makes money based on conversions. Therefore, it’s in your best interest to write ads that don’t have a high CTR or click-through rate but rather an ad that has a high conversion rate and high ROAS.

ROAS is completely dependent on keywords, which can be simply defined as words or phrases entered in the search bar of a search engine like Google which triggers your ads. Keywords, or a search query as it is commonly known, will result in a results page consisting of Google Ads, followed by organic results. If we Google car insurance, we will see that the top two entries on the results page are Google Ads (see Figure 1-3).

../images/498113_1_En_1_Chapter/498113_1_En_1_Fig3_HTML.jpg

Figure 1-3

Google Ads screenshot. Google and the Google logo are registered trademarks of Google LLC, used with permission

If your keywords are too broad, you’ll waste a bunch of money on irrelevant clicks. On the other hand, you can block unnecessary user clicks by creating a negative keyword list that excludes your ad being shown when a certain keyword is used as a search query.

This may sound intuitive, but the cost of running an ad on a given keyword on the basis of cost per click (CPC) is directly proportional to what other advertisers are bidding on that keyword. Generally speaking, for transactional keywords, its CPC is directly linked on how much volume of traffic the keyword generates, which in turn drives up its value. If you take an example of transactional keywords for insurance such as car insurance, the high traffic and the buy intent make its CPC one of the highest in the industry at over $50 per click. There are certain keyword queries made of phrases with two or more words, known as long tail keywords, which may actually see lower search traffic but are pretty competitive, and the simple reason for that is that longer keywords with prepositions sometimes capture buyer intent better than just one or two word search queries.

To accurately calculate ROAS, you need a keyword research tool to get accurate data on (1) what others are bidding in your geographical area of interest on a particular keyword, (2) the search volume associated with a particular keyword, (3) keyword suggestions so that you can find additional long tail keywords, and (4) lastly, you would like to generate a negative keyword list that includes words when appearing in a search query do not trigger your ad. As an example, if someone types free car insurance, that is a signal that they may not buy your car insurance product, and it would be insane to spend $50 on such a click. Hence, you can choose free as a negative keyword, and the ad won’t be shown to anyone who puts free in their search query.

Google’s official keyword research tool, called Keyword Planner, included all of the data I listed here up until a few years ago when they decided to change tactics and stopped showing exact search data in favor of insanely broad ranges like 10K–100K. You can get more accurate data if you spend more money on Google Ads; in fact, they don’t show any actionable data in the Keyword Planner for new accounts who haven’t spent anything on running ad campaigns.

This led to more and more users relying on third-party keyword research providers such as Ahrefs’s Keywords Explorer (https://ahrefs.com/keywords-explorer), Ubersuggest (https://neilpatel.com/ubersuggest/), and keywordtool.io/ (https://keywordtool.io/) that provide in-depth keyword research metrics. Not all of them are upfront about their data sourcing methodologies, but an open secret in the industry is that it’s coming from extensively scraping data from the official Keyword Planner and supplementing it with clickstream and search query data from a sample population across the world. These datasets are not cheap, with pricing going as high as $300/month based on how many keywords you search. However, this is still worth the price due to unique challenges in scraping Google Keyword Planner and methodological challenges of combining it in such a way to get an accurate search volume snapshot.

Search engine results page (SERP) scrapers

Many businesses want to check if their Google Ads are being correctly shown in a specific geographical area. Some others want SERP rankings for not only their page but their competitor’s pages in different geographical areas. Both of these use cases can be easily served by an API service which takes as an input a JSON with a search engine query and geographical area and returns a SERP page as a JSON. There are many providers such as SerpApi, Zenserp, serpstack, and so on, and pricing is around $30 for 5000 searches. From a technical standpoint, this is nothing but adding a proxy IP address, with CAPTCHA solving if required, to a traditional web scraping stack.

Search engine optimization (SEO)

This is a group of techniques whose sole aim is to improve organic rankings on the search engine results pages (SERPs).

There are dozens of books on SEO and even more blog posts, all describing how to improve your SERP ranking; we’ll restrict our discussions on SEO here to only those factors which directly need web scraping.

Each search engine uses their own proprietary algorithm to determine rankings, but essentially the main factors are relevance, trust, and authority. Let us go through them in greater detail.

Relevance

These are group of factors that measure how relevant a particular page is for a given search query. You can influence the ranking for a set of keywords by including them on your page and within meta tags on your page.

Search engines rely on HTML tags called meta to enable sites such as Google, Facebook, and Twitter to easily find certain information not visible to normal web users. Web masters are not mandated to insert these tags at all; however, doing so will not only help users on search engine and social media find information, but that will increase your search rankings too.

You can see these tags by right-clicking any page in your browser and clicking view source. As an example, let us get the source from Quandl.com; you may not yet be familiar with this website, but the information in the meta tags (meta property= og:description and meta name= twitter:description) tells you that it is a website for datasets in the financial domain (see Figure 1-4).

../images/498113_1_En_1_Chapter/498113_1_En_1_Fig4_HTML.jpg

Figure 1-4

Meta tags

It’s pretty easy to create a crawler to scrape your own website pages and see how effective your on-page optimization is so that search engines can find all the information and index it on their servers. Alternately, it’s also a good idea to scrape pages of your competitors and see what kind of text they have put in their meta tags. There are countless third-party providers offering a freemium audit report on your on-page optimization such as https://seositecheckup.com, https://sitechecker.pro, and www.woorank.com/.

Trust and authority

Obtaining a high relevance score to a given search query is important, but not the only factor determining your SERP rankings. The other factor in determining the quality of your site is how many other high-quality pages link to your site’s page (backlinks). The classic algorithm used at Google is called PageRank, and now even though there are a lot of other factors that go into determining SERP rankings, one of the best ways to rank higher is get backlinks from other high-quality pages; you will hear a lot of SEO firms call this the link juice, which in simple terms means the benefit passed on to a site by a hyperlink.

In the early days of SEO, people used to try black hat techniques of manipulating these rankings by leaving a lot of spam links to their website on comment boxes, forums, and other user-generated contents on high-quality websites. This rampant gaming of the system was mitigated by something known as a nofollow backlink, which basically meant that a webmaster could mark certain outgoing links as nofollow and then no link juice will pass from the high-quality site to yours. Nowadays, all outgoing hyperlinks on popular user-generated content sites like Wikipedia are marked with nofollow, and thankfully this has stopped the spam deluge of the 2000s. We show an example in Figure 1-5 of an external nofollow hyperlink at the Wikipedia page on PageRank; don’t worry about all the HTML tags, just focus on the nofollow for now.

../images/498113_1_En_1_Chapter/498113_1_En_1_Fig5_HTML.jpg

Figure 1-5

Nofollow HTML links

Building backlinks is a constant process because if you aren’t ahead of your competitors, you can start losing your SERP ranking. Alternately, if you know your competitor’s site’s backlinks, then you can target those websites by writing compelling content and see if you can steal some of the backlinks to boost your SERP rankings. Indeed, all of the strategies I mention here are followed by top SEO agencies every day for their clients.

Not all backlinks are gold. If your site gets disproportionate amount of backlinks from low-quality sites or spam farms (or link farms as they are also known), your site will also be considered spammy, and search engines will penalize you by dropping your ranking on SERP. There are some black hat SEOs out there that rapidly take down rankings of their competitor’s sites by using this strategy. Thankfully, you can mitigate the damage if you identify this in time and disavow those backlinks through Google Search Console.

Until now, I think I have made the case about why it’s useful to know your site’s backlinks and how people will be willing to pay if you can give them a database where they can simply enter either their site’s URL or their competitors and get all the backlinks.

Unfortunately, the only way to get all the backlinks is by crawling large portions of the Internet, just like search engines do, and that’s cost prohibitive for most businesses or SEO agencies to do in themselves. However, there are a handful of companies such as Ahrefs and Moz that operate in this area. The database size for Ahrefs is about 10 PB (= 10,000 TB) according to their information page (https://ahrefs.com/big-data); the storage cost alone for this on Amazon Web Services (AWS) S3 would come out to over $200,000/month so it’s no surprise that subscribing to this database is pricey at cheapest licenses starting at hundreds of dollars a month.

There is a free trial to the backlinks database which can be accessed here (https://ahrefs.com/backlink-checker); let us run an analysis on apress.com.

We see that Apress has over 1,500,000 pages linking back to it from about 9500 domains, and majority of these backlinks are dofollow links that pass on the link juice to Apress. The other metric of interest is the domain rating (DR), which normalizes a given website’s backlink performance on a 1–100 scale; the higher the DR score, the more link juice passed from the target site with each backlink. If you look at Figure 1-6, the top backlink is from www.oracle.com with its DR being 92. This indicates that the page is of highest quality, and getting such a top backlink helped Apress’s own DR immensely, which drove traffic to its pages and increased its SERP rankings.

../images/498113_1_En_1_Chapter/498113_1_En_1_Fig6_HTML.jpg

Figure 1-6

Ahrefs screenshot

Estimating traffic to a site

Every website owner can install analytics tools such as Google Analytics and find out what kind of traffic their site gets, but you can also estimate traffic by getting a domain ranking based on backlinks and performing some clever algorithmic tricks. This is indeed what Alexa does, and apart from offering backlink and keyword research ideas, they also give pretty accurate site traffic estimates for almost all websites. Their service is pretty pricey too, with individual licenses starting at $149/month, but the underlying value of their data makes this price tag reasonable for a lot of folks. Let us query Alexa for apress.com and see what kind of information it has collected for it (see Figure 1-7).

../images/498113_1_En_1_Chapter/498113_1_En_1_Fig7_HTML.jpg

Figure 1-7

Alexa screenshot

Their web-crawled database also provides a list of similar sites by audience overlap which seems pretty accurate since it mentions manning.com (another tech publisher) with a strong overlap score (see Figure 1-8).

../images/498113_1_En_1_Chapter/498113_1_En_1_Fig8_HTML.jpg

Figure 1-8

Alexa screenshot

It also provides data on the number of backlinks from different domain names and percentage of traffic received via search engines. One thing to note is that the number of backlinks by Alexa is 1600 (see Figure 1-9), whereas the Ahrefs database mentioned about 9000. Such discrepancies are common among different providers, and that just shows you the completeness of web crawls each of these companies is undertaking. If you have a paid subscription to them, then you can get the entire list and check for omissions yourself.

../images/498113_1_En_1_Chapter/498113_1_En_1_Fig9_HTML.jpg

Figure 1-9

Alexa screenshot showing the number of backlinks

Vertical search engines for recruitment, real estate, and travel

Websites such as indeed.com, Expedia, and Kayak all run web scrapers/crawlers to gather data focusing on specific segment of online content which they process further to extract out more relevant information such as name of the company, city, state, and job title in the case of indeed.com, which can be used for filtering through the search results. The same is true of all search engines where web scraping is at the core of their product, and the only differentiation between them is the segment they operate in and the algorithms they use to process the HTML content to extract out content which is used to power the search filters.

Brand, competitor, and price monitoring

Web scraping is used by companies to monitor prices of various products on ecommerce sites as well as customer reviews, social media posts, and news articles for not just their own brands but also for their competitors. This data helps companies understand how effective their current marketing funnel has been and also lets them get ahead of any negative reviews before they cause a noticeable impact on sales. There are far too many examples in this category, but Jungle Scout, AMZAlert, AMZFinder, camelcamelcamel, and Keepa all serve a segment of this market.

Social listening, public relations (PR) tools, and media contacts database

Businesses are very interested in what their existing and potential customers are saying about them on social media websites such as Twitter, Facebook, and Reddit as well as personal blogs and niche web forums for specialized products. This data helps businesses understand how effective their current marketing funnel has been and also lets them get ahead of any negative reviews before they cause a noticeable impact on sales. Small businesses can usually get away with manually searching through these sites; however, that becomes pretty difficult for businesses with thousands of products on ecommerce sites. In such cases, they use professional tools such as Mention, Hootsuite, and Specrom, which can allow them to do bulk monitoring. Almost all of these get some fraction of data through web crawling.

In a slightly different use case, businesses also want to guide their PR efforts by querying for contact details for a small number of relevant journalists and influencers who have a good following and readership in a particular niche. The raw database remains the same as previously discussed, but in this case, the content is segmented by topics such as apparels, fashion accessories, electronics, restaurants, and so on and results combined with a contacts database. A user should be able to query something like find email addresses and phone numbers for ten top journalists/influencers active in the food, beverage, and restaurant market in the Pittsburgh, PA area. There are too many products out there, but some of them include Muck Rack, Specrom, Meltwater, and Cision.

Historical news databases

There is a huge demand out there for searching historical news articles by keyword and returning news titles, content body, author names, and so on in bulk to be used for competitor, regulatory, and brand monitoring. Google News allows a user to do it to some extent, but it still doesn’t quite meet the needs of this market. Aylien, Specrom Analytics, and Bing News all provide an API to programmatically access news databases, which index 10,000–30,000 sources in all major languages in near real time and archives going back at least five or more years. For some use cases, consumers want these APIs coupled to an alert system where they get automatically notified when a certain keyword is found in the news, and in those cases, these products do cross over to social listening tools described earlier.

Web technology database

Businesses want to know about all the individual tools, plugins, and software libraries which are powering individual websites. Of particular interest is knowing about what percentage of major sites run a particular plugin and if that number is stable, increasing, or decreasing.

Once you know this, there are many ways to benefit from it. For example, if you are selling a web plugin, then you can identify your competitors, their market penetration, and use their customers as potential leads for your business.

All of the data I mentioned here can be aggregated by web crawling through millions of websites and aggregating the data in headers and response by a plugin type or displaying all

Enjoying the preview?

Page 1 of 1

Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale

About this ebook

Jay M. Patel

Related authors

Related to Getting Structured Data from the Internet

Related ebooks

Databases For You

Related podcast episodes

Related articles

Related categories

Reviews for Getting Structured Data from the Internet

What did you think?

Book preview

Getting Structured Data from the Internet - Jay M. Patel

1. Introduction to Web Scraping

Who uses web scraping?

Marketing and lead generation

Search engines

On-site search and recommendation

Google Ads and other pay-per-click (PPC) keyword research tools

Search engine results page (SERP) scrapers

Search engine optimization (SEO)

Relevance

Trust and authority

Estimating traffic to a site

Vertical search engines for recruitment, real estate, and travel

Brand, competitor, and price monitoring

Social listening, public relations (PR) tools, and media contacts database

Historical news databases

Web technology database