One: Scrape the Initial/Infant HTML by Code
using System.IO;
using System.Net;
using System.Text;
using System.Web.Mvc;
namespace Scraper.Controllers
{
public class HomeController : Controller
{
public ActionResult Index()
{
string url = "https://twitter.com/";
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
StringBuilder scrappingSpool = new StringBuilder();
using (HttpWebResponse response = (HttpWebResponse) request.GetResponse())
{
using (Stream stream = response.GetResponseStream())
{
int counter = 0;
byte[] buffer = new byte[1000000];
do
{
counter = stream.Read(buffer, 0, buffer.Length);
if (counter != 0)
{
string chunk = Encoding.ASCII.GetString(buffer, 0, counter);
scrappingSpool.Append(chunk);
}
} while (counter > 0);
}
}
string scrapping = scrappingSpool.ToString();
return View(scrapping);
}
}
}
Here the "scrapping" variable will end up with the immediate contents of https://twitter.com/ that one gets served up just by visiting the page. This begs the questions: "What if I want to log in at the Twitter site and make some content appear which only appears after the page loads by way of AJAX?" and "What if I want to scrape after that?" Well, I'm getting to that next. The C# above is a spruced up version of something I've had in my notes for a while:
Two: Scrape the Matured HTML by Firebug
- Get Firefox and install it.
- Install the Firebug plugin for Firefox.
- Restart Firefox and then visit https://twitter.com/.
- Log in.
- Scroll down on the page on the other side of the log in, forcing new HTML content for older tweets to appear by way of AJAX.
- Click on the Firebug icon at the upper right of Firefox. It will open a pane for Firebug.
- "Click an element in the page to inspect." should appear when you hover over the icon that looks like a rectangle with a pointer over it which is the second in from the left at the upper left of the Firebug pane. Click this icon.
- Move the mouse about the browser window. Try to highlight the div holding all of the tweets and then click on it.
- The appropriate line of code will be highlighted in the Firebug pane. Right-click on it and pick "Copy innerHTML."
- Copy into Notepad!
Three: Scrape the Matured HTML by Code
PhantomJS should be the key to the best of both of the worlds above. Have I used it yet? No I haven't. :(
No comments:
Post a Comment