Four Programming Languages Creating a Complete Website Scraper Application
()
About this ebook
Four Programming Languages Creating a Complete Website Scraper Application
After finishing these pages you will have a complete application which will work for either console or desktop platform. You will be utilizing three languages - C#,VB.Net and Java for creating this application. Each chapter covers a single language and either the desktop or console application coded in that language (Java does not natively allow a console application, so it includes only Desktop). For console program automation purposes, we will be using an Excel sheet and VBA coding. Using the desktop application allows for more flexibility in web page processing, with entry fields for beginning and ending text along with DIVs and other processing options. Enjoy this learning experience.
This list includes some of the types/commands and the languages that use them
WebResponse, WebRequest, HttpWebRequest, StreamReader (C#/VB)
GetResponse, Regex.Replace, String.Replace, IndexOf (C#/VB)
Substring, ReadLine, Trim, WriteLine (C#/VB)
EndsWith, AddRange, ReadToEnd, Count (C#/VB)
GetCommandLineArgs, GetResponseStream (VB)
getText, endsWith, split, length, openConnection (Java)
toString, BufferedReader, getSelectedIndex, replaceAll (Java)
isEmpty, substring,indexOf, readLine, PrintWriter, write (Java)
ActiveCell,Value,ChDir,Shell,Activate (VBA)
Why would you want to work with the same program in multiple languages? A simple answer to this is "versatility." You may come across a need for Java where a .Net-based language just won't work. A perfect example of this is Windows versus Linux web hosting. If you have designed a .Net program and placed it on your site based on Windows, it will work beautifully. If you then change the hosting plan to Linux, the .Net program will not work without some tweaking or an interpreter. If that were written in Java, however, it would have moved over fine.
Why would you want a web site text extraction program? Well, if you had a need to capture the main text from a few web pages, this would be too much trouble. If you are migrating a web site designed in ASP.NET into another format, maybe a CMS, this approach can be quite useful. If you have 1,000 pages in the site and all are similarly structured, it may take a week for a single person to manually copy and paste the body text from these pages. Using the automated approach, with a pause between each page for accuracy purposes, approximately 700 pages per hour can be processed. That equates to a tremendous labor savings.
Stephen J Link
Stephen J. Link is a “computer guy” by profession, an author by hobby, and a Layman in the study of God’s Word. He has a computer support book entitled “Link Em Up On Outlook” that was published in 2004 as a paperback. He also has over 125 articles covering various topics published on independent sites. Some of those articles were originally published by the Free Will Baptist Press in Ayden, NC as high school age Sunday school lessons.
Read more from Stephen J Link
HTML5,CSS3,Javascript and JQuery Mobile Programming: Beginning to End Cross-Platform App Design Rating: 5 out of 5 stars5/5Wisdom of Proverbs: Take the 31 Day Journey Rating: 0 out of 5 stars0 ratingsWordPress 4 Business Website Redesign: With Custom Coding Of Imported Database Rating: 1 out of 5 stars1/5The Journey Along God's Road to Revelation Rating: 0 out of 5 stars0 ratings
Related to Four Programming Languages Creating a Complete Website Scraper Application
Related ebooks
JavaScript: Novice to Ninja Rating: 2 out of 5 stars2/5Javascript Concepts: 1St Edition Rating: 0 out of 5 stars0 ratingsThe Best Javascript Rating: 0 out of 5 stars0 ratingsJavaScript: Best Practice Rating: 0 out of 5 stars0 ratings.Net Framework and Programming in ASP.NET Rating: 0 out of 5 stars0 ratingsASP.NET Application Development Fundamentals Rating: 0 out of 5 stars0 ratingsIntroduction to PHP Web Services: PHP, JavaScript, MySQL, SOAP, RESTful, JSON, XML, WSDL Rating: 0 out of 5 stars0 ratingsJavaScript: Tips and Tricks to Programming Code with Javascript Rating: 0 out of 5 stars0 ratingsGetting started with php & mysql: Professional training Rating: 0 out of 5 stars0 ratingsASP.NET For Beginners: The Simple Guide to Learning ASP.NET Web Programming Fast! Rating: 0 out of 5 stars0 ratingsUnderstanding Python: Beginner's Guide to Programming Rating: 0 out of 5 stars0 ratingsNode.js: Novice to Ninja Rating: 0 out of 5 stars0 ratingsJavaScript: Beginner's Guide to Programming Code with JavaScript Rating: 5 out of 5 stars5/5Computer Programming: From Beginner to Badass—JavaScript, HTML, CSS, & SQL Rating: 3 out of 5 stars3/5Learn PHP in 24 Hours Rating: 0 out of 5 stars0 ratingsA concise guide to PHP MySQL and Apache Rating: 4 out of 5 stars4/5C# Programming Illustrated Guide For Beginners & Intermediates: The Future Is Here! Learning By Doing Approach Rating: 0 out of 5 stars0 ratingsProgramming Essentials Rating: 5 out of 5 stars5/5JavaScript Unlocked Rating: 5 out of 5 stars5/5Rails: Novice to Ninja: Build Your Own Ruby on Rails Website Rating: 4 out of 5 stars4/5Spring Boot and Single-Page Applications: Securing Your API with a Single-Page Application Frontend - Second Edition Rating: 0 out of 5 stars0 ratingsIOS Programming For Beginners: The Simple Guide to Learning IOS Programming Fast! Rating: 0 out of 5 stars0 ratingsLearn ASP.NET Core MVC - Be Ready Next Week Using Visual Studio 2017 Rating: 5 out of 5 stars5/5JavaScript: Beginner's Guide to Programming Code with JavaScript: JavaScript Computer Programming Rating: 0 out of 5 stars0 ratingsSwift Programming Nuts and bolts Rating: 0 out of 5 stars0 ratingsJavaScript for the Business Developer Rating: 3 out of 5 stars3/5C# For Beginners: An Introduction to C# Programming with Tutorials and Hands-On Examples Rating: 0 out of 5 stars0 ratingsSQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM Rating: 0 out of 5 stars0 ratings
Programming For You
HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 0 out of 5 stars0 ratingsLearn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application Rating: 0 out of 5 stars0 ratingsCoding All-in-One For Dummies Rating: 4 out of 5 stars4/5Java for Beginners: A Crash Course to Learn Java Programming in 1 Week Rating: 5 out of 5 stars5/5Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1 Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5Python Projects for Beginners: A Ten-Week Bootcamp Approach to Python Programming Rating: 0 out of 5 stars0 ratingsSQL: For Beginners: Your Guide To Easily Learn SQL Programming in 7 Days Rating: 5 out of 5 stars5/5PYTHON: Practical Python Programming For Beginners & Experts With Hands-on Project Rating: 5 out of 5 stars5/5Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5The Little SAS Book: A Primer, Sixth Edition Rating: 5 out of 5 stars5/5Teach Yourself C++ Rating: 4 out of 5 stars4/5Pokemon Go: Guide + 20 Tips and Tricks You Must Read Hints, Tricks, Tips, Secrets, Android, iOS Rating: 5 out of 5 stars5/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5
Reviews for Four Programming Languages Creating a Complete Website Scraper Application
0 ratings0 reviews
Book preview
Four Programming Languages Creating a Complete Website Scraper Application - Stephen J Link
Four Programming Languages Creating a Complete Website Scraper Application
Written by Stephen J. Link
Smashwords Edition 2014
copyright © 2014 Stephen J Link
All Rights Reserved. The code created for this EBook is to be used for your learning experience. It is not to be repackaged and sold without modification and customization from the coding style and format displayed in these pages.
I would like to give thanks to my God. Without Him, life would be unbearable. Thanks also to my wife who has supported me through the research and study time required to complete this book.
Table of Contents
Introducing Your Author
About This Book
The C# Command Line Project
The C# Desktop Project
The VB.Net Command Line Project
The VB.Net Desktop Project
The Java Desktop Project
Excel VBA Automation
Other Resources of Interest
Other Items from LinkEmUp Publishing
Download the Source Code
Introducing Your Author
Stephen J. Link is a computer guy
by profession, an author by hobby, and a Layman in the study of God’s Word. He has a computer support book entitled Link Em Up On Outlook
that was published in 2004 as a paperback (renamed to Power Outlook
in reprint). He also has over 125 articles covering various topics published on his own blog and independent sites. Various Books have been published covering a number of topics. As a programmer, he has a unique approach to help you master the ability to create the code to automate processes and add efficiency to your client's or employer's processes. Along this journey, you will have an opportunity to dabble
in four different languages - CSharp, VB.Net, Java and Visual Basic for Applications.
Why use the word dabble?
Because there is no way, in a single program, to squeeze all of the power that can be utilized in any programming language. Yes, you will see four languages and two platforms, but they all create the same functioning program. As with all software, web sites and anything else you may develop, it is never totally complete. As you are working through the programs, or after finishing them, you are likely to see improvements that could have been made. That is the beauty of computer programming - the only limits are imposed by your imagination and amount of funding.
About This Book
After finishing these pages you will have a complete application which will work for either console or desktop platform. You will be utilizing three languages - C#,VB.Net and Java for creating this application. Each chapter covers a single language and either the desktop or console application coded in that language. For console program automation purposes, we will be using an Excel sheet and VBA coding. Using the desktop application allows for more flexibility in web page processing, with entry fields for beginning and ending text along with DIVs and other processing options. Enjoy this learning experience.
Let's discuss the makeup of a web page. What is displayed in your chosen web browser results from interpretation of HTML code, CSS formatting, and scripting language. All of this can be viewed utilizing the view source
option in your browser. Because of this, it is also possible to scrape
the desired body text from the page. Since the body
tag of the page may contain menus and pictures, we are starting after a specific text string and stopping prior to an entered string. You will also be able to exclude specific DIVs within the text string being captured.
Why would you want such a program? Well, if you had a need to capture the main text from a few web pages, this program would be too much trouble. If you are migrating a web site designed in ASP.NET into another format, maybe a CMS, this approach can be quite useful. If you have 1,000 pages in the site and all are similarly structured, it may take a week for a single person to manually copy and paste the body text from these pages. Using the automated approach, with a pause between each page for accuracy purposes, approximately 700 pages per hour can be processed. That equates to a tremendous labor savings.
Because of the cross-platform design of Java programming, you will not be receiving a console application written in Java code. The Java project was designed using the open-source Netbeans application. Older versions of NetBeans rigged
a Java Swing UI and called it a console application. This approach has been dropped with the newer releases of the software. If desired, you can design and use the console applications in either C# or VB.Net.
Throughout this book, we will take little side trips
which will be noted by the Side Trip label. These do not have a direct link to the code and design, although they can provide a useful learning experience.
What's next? That would depend on your approach
If you want to learn a specific language, turn to that chapter and start learning and practicing your new skills.
You want to learn all three languages coupled with the console and desktop programming platforms. In this case, start with chapter 1 and work your way through the entire book.
Chapter 1
The C# command line project
The code shown below is used for the command line version of the screen scraper program. The DOS-based EXE file can be automated with a batch file, Excel file or many other inventive methods that programmers may come up with. It could also be executed manually by typing each page into the command line prompt. We will start with showing you the first snippet of code.
// These are the required using statements which provide external functionality
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.IO;
using System.Text.RegularExpressions;
The code can be typed into your C# command-line project (named web_scraper_console, in this example) if desired. Let's go through this code and explain the processing thoroughly. Starting at the top we have the using
statements. As the code comment says, these are required to provide access to some of the external functions used in the program. Next, you will see the namespace, class and Main function. All of these will be created as the skeleton program when creating your project in Visual Studio.
namespace Web_Scraper_Console_
{
class Program
{
// Although the input arguments can be multiple, we are only processing one
static void Main(string[] args)
{
One thing that may require a little explanation is the string[] args
which is a parameter in the Main function. Using this approach with a console program allows it to accept an array of parameters from the command line, such as web_scraper_console index.php page1.aspx page2.html.
This command will feed in an array of three pages to be processed, and you could modify the displayed code to utilize that functionality if desired. The program, as designed, only processes a single page; we will address a separate automation approach using Excel and VBA in a later chapter.
try
{
We use a try/catch block to catch any errors, and then move into the meat
of the program.
string url = http://www.linkemup.us/
+ args[0] + .aspx
;
string strResult = ;
string progresult = ;
string outresult = ;
int charloc = 0;
int endloc = 0;
string delstr = ;
The first command in the program reads the only parameter allowed, args[0], and uses it to create the url
string which will contain the full Uniform Resource Locator of the web page that will be processed. In this case, we are using my web site - www.linkemup.us, and automatically appending .aspx
onto the page. A more functional approach is to allow the page extension to be entered in the command line, which will allow for the most flexibility. This code, as displayed, will not process completely because my web site is not based on ASP.Net pages. Also note that my web site is moving to a more descriptive domain name - www.ncwebdesignprogramming.com so the other site will probably not be available after October of 2014.
Side Trip Let's examine the general functionality of your favorite web browser - Internet Explorer, Firefox, Chrome or one of the many others available. When you type a web site URL into the address line, it creates a webrequest which is sent to the target web server. A webresponse is also created to receive the response from the web server. Your browser then processes the response and displays the page that you had asked for.
The next few lines set up some processor string
variables and a couple of location