Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Four Programming Languages Creating a Complete Website Scraper Application
Four Programming Languages Creating a Complete Website Scraper Application
Four Programming Languages Creating a Complete Website Scraper Application
Ebook145 pages1 hour

Four Programming Languages Creating a Complete Website Scraper Application

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Four Programming Languages Creating a Complete Website Scraper Application

After finishing these pages you will have a complete application which will work for either console or desktop platform. You will be utilizing three languages - C#,VB.Net and Java for creating this application. Each chapter covers a single language and either the desktop or console application coded in that language (Java does not natively allow a console application, so it includes only Desktop). For console program automation purposes, we will be using an Excel sheet and VBA coding. Using the desktop application allows for more flexibility in web page processing, with entry fields for beginning and ending text along with DIVs and other processing options. Enjoy this learning experience.
This list includes some of the types/commands and the languages that use them

WebResponse, WebRequest, HttpWebRequest, StreamReader (C#/VB)
GetResponse, Regex.Replace, String.Replace, IndexOf (C#/VB)
Substring, ReadLine, Trim, WriteLine (C#/VB)
EndsWith, AddRange, ReadToEnd, Count (C#/VB)
GetCommandLineArgs, GetResponseStream (VB)
getText, endsWith, split, length, openConnection (Java)
toString, BufferedReader, getSelectedIndex, replaceAll (Java)
isEmpty, substring,indexOf, readLine, PrintWriter, write (Java)
ActiveCell,Value,ChDir,Shell,Activate (VBA)

Why would you want to work with the same program in multiple languages? A simple answer to this is "versatility." You may come across a need for Java where a .Net-based language just won't work. A perfect example of this is Windows versus Linux web hosting. If you have designed a .Net program and placed it on your site based on Windows, it will work beautifully. If you then change the hosting plan to Linux, the .Net program will not work without some tweaking or an interpreter. If that were written in Java, however, it would have moved over fine.
Why would you want a web site text extraction program? Well, if you had a need to capture the main text from a few web pages, this would be too much trouble. If you are migrating a web site designed in ASP.NET into another format, maybe a CMS, this approach can be quite useful. If you have 1,000 pages in the site and all are similarly structured, it may take a week for a single person to manually copy and paste the body text from these pages. Using the automated approach, with a pause between each page for accuracy purposes, approximately 700 pages per hour can be processed. That equates to a tremendous labor savings.

LanguageEnglish
Release dateSep 6, 2014
ISBN9781311735225
Four Programming Languages Creating a Complete Website Scraper Application
Author

Stephen J Link

Stephen J. Link is a “computer guy” by profession, an author by hobby, and a Layman in the study of God’s Word. He has a computer support book entitled “Link Em Up On Outlook” that was published in 2004 as a paperback. He also has over 125 articles covering various topics published on independent sites. Some of those articles were originally published by the Free Will Baptist Press in Ayden, NC as high school age Sunday school lessons.

Read more from Stephen J Link

Related to Four Programming Languages Creating a Complete Website Scraper Application

Related ebooks

Programming For You

View More

Related articles

Reviews for Four Programming Languages Creating a Complete Website Scraper Application

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Four Programming Languages Creating a Complete Website Scraper Application - Stephen J Link

    Four Programming Languages Creating a Complete Website Scraper Application

    Written by Stephen J. Link

    Smashwords Edition 2014

    copyright © 2014 Stephen J Link

    All Rights Reserved. The code created for this EBook is to be used for your learning experience. It is not to be repackaged and sold without modification and customization from the coding style and format displayed in these pages.

    I would like to give thanks to my God. Without Him, life would be unbearable. Thanks also to my wife who has supported me through the research and study time required to complete this book.

    Table of Contents

    Introducing Your Author

    About This Book

    The C# Command Line Project

    The C# Desktop Project

    The VB.Net Command Line Project

    The VB.Net Desktop Project

    The Java Desktop Project

    Excel VBA Automation

    Other Resources of Interest

    Other Items from LinkEmUp Publishing

    Download the Source Code

    Introducing Your Author

    Stephen J. Link is a computer guy by profession, an author by hobby, and a Layman in the study of God’s Word. He has a computer support book entitled Link Em Up On Outlook that was published in 2004 as a paperback (renamed to Power Outlook in reprint). He also has over 125 articles covering various topics published on his own blog and independent sites. Various Books have been published covering a number of topics. As a programmer, he has a unique approach to help you master the ability to create the code to automate processes and add efficiency to your client's or employer's processes. Along this journey, you will have an opportunity to dabble in four different languages - CSharp, VB.Net, Java and Visual Basic for Applications.

    Why use the word dabble? Because there is no way, in a single program, to squeeze all of the power that can be utilized in any programming language. Yes, you will see four languages and two platforms, but they all create the same functioning program. As with all software, web sites and anything else you may develop, it is never totally complete. As you are working through the programs, or after finishing them, you are likely to see improvements that could have been made. That is the beauty of computer programming - the only limits are imposed by your imagination and amount of funding.

    About This Book

    After finishing these pages you will have a complete application which will work for either console or desktop platform. You will be utilizing three languages - C#,VB.Net and Java for creating this application. Each chapter covers a single language and either the desktop or console application coded in that language. For console program automation purposes, we will be using an Excel sheet and VBA coding. Using the desktop application allows for more flexibility in web page processing, with entry fields for beginning and ending text along with DIVs and other processing options. Enjoy this learning experience.

    Let's discuss the makeup of a web page. What is displayed in your chosen web browser results from interpretation of HTML code, CSS formatting, and scripting language. All of this can be viewed utilizing the view source option in your browser. Because of this, it is also possible to scrape the desired body text from the page. Since the body tag of the page may contain menus and pictures, we are starting after a specific text string and stopping prior to an entered string. You will also be able to exclude specific DIVs within the text string being captured.

    Why would you want such a program? Well, if you had a need to capture the main text from a few web pages, this program would be too much trouble. If you are migrating a web site designed in ASP.NET into another format, maybe a CMS, this approach can be quite useful. If you have 1,000 pages in the site and all are similarly structured, it may take a week for a single person to manually copy and paste the body text from these pages. Using the automated approach, with a pause between each page for accuracy purposes, approximately 700 pages per hour can be processed. That equates to a tremendous labor savings.

    Because of the cross-platform design of Java programming, you will not be receiving a console application written in Java code. The Java project was designed using the open-source Netbeans application. Older versions of NetBeans rigged a Java Swing UI and called it a console application. This approach has been dropped with the newer releases of the software. If desired, you can design and use the console applications in either C# or VB.Net.

    Throughout this book, we will take little side trips which will be noted by the Side Trip label. These do not have a direct link to the code and design, although they can provide a useful learning experience.

    What's next? That would depend on your approach

    If you want to learn a specific language, turn to that chapter and start learning and practicing your new skills.

    You want to learn all three languages coupled with the console and desktop programming platforms. In this case, start with chapter 1 and work your way through the entire book.

    Chapter 1

    The C# command line project

    The code shown below is used for the command line version of the screen scraper program. The DOS-based EXE file can be automated with a batch file, Excel file or many other inventive methods that programmers may come up with. It could also be executed manually by typing each page into the command line prompt. We will start with showing you the first snippet of code.

    // These are the required using statements which provide external functionality

    using System;

    using System.Collections.Generic;

    using System.Linq;

    using System.Text;

    using System.Net;

    using System.IO;

    using System.Text.RegularExpressions;

    The code can be typed into your C# command-line project (named web_scraper_console, in this example) if desired. Let's go through this code and explain the processing thoroughly. Starting at the top we have the using statements. As the code comment says, these are required to provide access to some of the external functions used in the program. Next, you will see the namespace, class and Main function. All of these will be created as the skeleton program when creating your project in Visual Studio.

    namespace Web_Scraper_Console_

    {

    class Program

    {

    // Although the input arguments can be multiple, we are only processing one

    static void Main(string[] args)

    {

    One thing that may require a little explanation is the string[] args which is a parameter in the Main function. Using this approach with a console program allows it to accept an array of parameters from the command line, such as web_scraper_console index.php page1.aspx page2.html. This command will feed in an array of three pages to be processed, and you could modify the displayed code to utilize that functionality if desired. The program, as designed, only processes a single page; we will address a separate automation approach using Excel and VBA in a later chapter.

    try

    {

    We use a try/catch block to catch any errors, and then move into the meat of the program.

    string url = http://www.linkemup.us/ + args[0] + .aspx;

    string strResult = ;

    string progresult = ;

    string outresult = ;

    int charloc = 0;

    int endloc = 0;

    string delstr = ;

    The first command in the program reads the only parameter allowed, args[0], and uses it to create the url string which will contain the full Uniform Resource Locator of the web page that will be processed. In this case, we are using my web site - www.linkemup.us, and automatically appending .aspx onto the page. A more functional approach is to allow the page extension to be entered in the command line, which will allow for the most flexibility. This code, as displayed, will not process completely because my web site is not based on ASP.Net pages. Also note that my web site is moving to a more descriptive domain name - www.ncwebdesignprogramming.com so the other site will probably not be available after October of 2014.

    Side Trip Let's examine the general functionality of your favorite web browser - Internet Explorer, Firefox, Chrome or one of the many others available. When you type a web site URL into the address line, it creates a webrequest which is sent to the target web server. A webresponse is also created to receive the response from the web server. Your browser then processes the response and displays the page that you had asked for.

    The next few lines set up some processor string variables and a couple of location

    Enjoying the preview?
    Page 1 of 1