Web Scraping Articles

This tutorial demonstrates how to use the New York Times Articles Search API using Python. From the API's documentation:

With the Article Search API, you can search New York Times articles from Sept. 18, 1851 to today, retrieving headlines, abstracts, lead paragraphs, links to associated multimedia and other article metadata.

You can end the article here and call it a day, but there’s something else we can build from this – a simple, easy to use web application. Let’s do that next. Web application for scraping. R has a fantastic library for web/dashboard development – `Shiny`. It’s far easier to use than anything similar in Python, so we’ll stick with it. Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API. Although web scraping can be done manually, in most cases, automated tools are preferred when scraping web data as they can be less costly and work at a. Web scraping is the name given to the process of extracting structured data from third-party websites. In other words, it's a way to capture specific information from one or more websites without. I discovered that a huge part of the articles were not available without a premium account. So I had to be connected to view the content and scrape it. Fortunately, we managed to get a premium account. I just had to find a way to tell my code how to: Authenticate to lemonde.fr; Stay connected while scraping; Step Two: How to authenticate to a. Web scrape comments and news articles How do you know which article is better written when you have different versions of it in similar news websites. One option is manual intervention but that can be kept aside for unique situations, since manual intervention is costly and cannot be implemented at scale.

The API will not return full text of articles. But it will return a number of helpful metadata such as subject terms, abstract, and date, as well as URLs, which one could conceivably use to scrape the full text of articles.

To begin, you first need to obtain an API key from the New York Times, which is fast and easy to do. See here for more information.

You also need to install the nytimesarticle package, which is a python wrapper for the New York Times Article Search API. This allows you to query the API through python.

To get started, let's fire up our favorite Python environment (I'm a big fan of ipython notebook):

Now we can use the search function with our desired search parameters/values:

The q (for query) parameter searches the article's body, headline and byline for a particular term. In this case, we are looking for the search term ‘Obama’. The fq (for filter query) parameter filters search results by various dimensions. For instance, ‘headline’:’Obama’ will filter search results to those with ‘Obama’ in the headline. 'source':['Reuters','The New York Times'] will filter by source (Reuters, New York Times, and AP are available through the API.) The begin_date parameter (in YYYYMMDD format) limits the date range of the search.

Post box online. Link an Existing PO Box. Step 1: Select 'Link' above and sign in with your USPS.com credentials. Step 2: Enter in your PO Box number, ZIP Code ™, and last name or business name. Step 3: Confirm your PO Box details and submit. You can now manage your PO Box online.

As you can see, we can specify multiple filters by using a python dictionary and multiple values by using a list:fq = {'headline':'Obama', 'source':['Reuters','AP', 'The New York Times']}

There are many other parameters and filters we can use to specify our serach. Get a full list here.

The search function returns a dictionary of the first 10 results. To get the next 10, we have to use the page parameter. page = 2 returns the second 10 results, page = 3 the third 10 and so on.

Web Scraping Wikipedia Articles

If you run the code, you'll see that the returned dictionary is pretty messy. What we’d really like to have is a list of dictionaries, with each dictionary representing an article and each dictionary representing a field of metadata from that article (e.g. headline, date, etc.) We can do this with a custom function:

I’ve only included the fields that I find most relevant, but you can easily add any field that I missed.

Now that we have a function to parse results into a clean list, we can easily write another function that collects all articles for a search query in a given year. In this example, I want to find all the articles in Reuters, AP, and The New York Times with the search query ‘Amnesty International’:

This function will input a year and search query, and return a list of all articles that fit those parameters, parsing them into a nice list of dictionaries. With this, we can scale up and loop over as many years as we want:

Now we have an object called Amnesty_all that lists a dictionary for each article, each containing fields like Headline, Date, Locations, Subjects, Abstract, Word Count, URL, etc.

Pretty neat! We can then export the dataset into a CSV (with each row as an article, and columns for metadata) and analyze it to explore interesting questions.

Web Scraping Free

To export into a csv, I like to use the csv module:

Web Scraping Articles Examples

Best Web Scraping Tools

And there you have it! You just learned how to collect years worth of articles from the New York Times, parse them, and download the resulting database as a csv.

Web Scraping Articles Examples

Rochelle Terman

Web Scraping Articles In English

Rochelle Terman received her Ph.D. in Political Science at UC Berkeley in 2016, and is now a post-doctoral fellow at Stanford University. She studies international norms, gender, and identity using computational and data intensive methods. At the D-Lab, she gives training on Python, R, Git, webscraping, computational text analysis, web development and basic programming skills.

Comments are closed.