Advanced Web Scraping with Python: Building a Flask API for Data Retrieval



Web scraping is the process of extracting data from websites. It is an important technique for businesses and researchers who need to gather data from multiple sources quickly and efficiently. In this blog, we will explore how to perform advanced web scraping and build a Flask API to serve the scraped data.

Python script to perform the web scraping

The first step in our process is to create a Python script to perform the web scraping. We will be using the requests and lxml modules to download and parse the HTML of the website, respectively. We will also be using the sqlite3 module to store the scraped data in a SQLite database. Let's take a look at the code:

To install requsts and lxml using pip

!pip install requests 
 !pip install lxml 

first we define a function called scrape that takes a URL as an argument. This function uses the requests module to download the HTML of the website, and then uses the lxml module to parse the HTML.

We then use XPath to select all the <\div> elements on the page that have a class of "quote". For each of these elements, we extract the text of the quote and the name of the author.

import requests
import lxml
import requests
import lxml
def scrape(link):
    r = requests.get(link)
    response = html.fromstring(r.content)
    containers = response.xpath('//div[@class="quote"]')
    for container in containers:
        quote = container.xpath('.//span[@class="text"]/text()')
        if quote:
            quote = quote[0].replace("'", "''")
        else:
            quote = ''
        author = container.xpath('.//small[@class="author"]/text()')
        print(quote, author)
link = 'http://quotes.toscrape.com/'
scrape(link) 

Let's establish a connection to a SQLite database

Now that we are able to scrape the data using parse method lets establish a connection to a SQLite database using the sqlite3 module. We then create a table called flaskapi_table to store the scraped data.

 import sqlite3
connection = sqlite3.connect('db.sqlite3')
cursor = connection.cursor()
cursor.execute(''' CREATE TABLE IF NOT EXISTS flaskapi_table (id INTEGER PRIMARY KEY AUTOINCREMENT, quotes TEXT, author TEXT) ''')
connection.commit() 

Join both scraping and database integration

In this step we will try to join both scraping and database integration to have a better prospective of the process and its working.

#scraper.py
import requests
from lxml import html
import sqlite3
connection = sqlite3.connect('db.sqlite3')
cursor = connection.cursor()
cursor.execute(''' CREATE TABLE IF NOT EXISTS flaskapi_table (id INTEGER PRIMARY KEY AUTOINCREMENT, quotes TEXT, author TEXT) ''')
connection.commit()
def scrape(link):
    r = requests.get(link)
    response = html.fromstring(r.content)
    containers = response.xpath('//div[@class="quote"]')
    for container in containers:
        quote = container.xpath('.//span[@class="text"]/text()')
    if quote:
        quote = quote[0].replace("'", "''")
    else:
        quote = ''
    author = container.xpath('.//small[@class="author"]/text()')
    if author:
        author = author[0].replace("'", "''")
    else:
        author = ''
    cursor.execute(f''' INSERT INTO flaskapi_table(quotes, author) VALUES('{quote}', '{author}') ''')
    connection.commit()
    print(quote, author)
link = 'http://quotes.toscrape.com/'
scrape(link) 

Now we can build a Flask API to serve the data

Finally, we call the scrape function with the URL of the website we want to scrape. This will extract all the quotes and authors from the website and store them in our database.

Now that we have our scraped data, we can build a Flask API to serve it. Flask is a lightweight web framework for Python that makes it easy to build RESTful APIs.

In this script, we first import the Flask and jsonify modules from the Flask framework, as well as the sqlite3 module

 from flask import Flask, jsonify
import sqlite3 

Next, we define the app object which is our Flask application instance. We set the __name__ parameter to __name__ which is a convenient way for Flask to know the root directory of our application.


from flask import Flask, jsonify
import sqlite3
app = Flask(__name__)

After that, we define two routes for our application using the @app.route decorator. The first route /data will return all the data from our SQLite database. We create a connection object using sqlite3.connect() and a cursor object to execute the SQL query. The cursor.execute() method is used to execute the query "SELECT * FROM flaskapi_table" to fetch all the data from the flaskapi_table table.

The data variable then stores the results of the query, and we return the data as JSON using the jsonify() function. The jsonify() function converts the results of the query into a JSON object that can be sent as a response to an HTTP request.

from flask import Flask, jsonify
import sqlite3
app = Flask(__name__)
@app.route('/data', methods=['GET'])
def get_data():
    connection = sqlite3.connect('db.sqlite3')
    cursor = connection.cursor()
    data = cursor.execute(" SELECT * FROM flaskapi_table") return jsonify({'data': data.fetchall()}) 

The second route /data/<\int:id> will return a specific quote based on the id parameter passed in the URL. The id parameter is extracted from the URL using the <\int:id> syntax. This syntax tells Flask to convert the id parameter to an integer.

We then use the cursor.execute() method to execute the query "SELECT * FROM flaskapi_table WHERE id=<\id>" to fetch the data for the specific quote with the given id. The id parameter is passed to the query using string interpolation.

Finally, we return the data for the specific quote as JSON using the jsonify() function.

 from flask import Flask, jsonify
    import sqlite3
    app = Flask(__name__)
    @app.route('/data', methods=['GET'])
    def get_data():
        connection = sqlite3.connect('db.sqlite3')
        cursor = connection.cursor()
        data = cursor.execute(" SELECT * FROM flaskapi_table")
        return jsonify({'data': data.fetchall()})
    @app.route('/data/<\int:id>', methods=['GET'])
    def get_id(id):
        connection = sqlite3.connect('db.sqlite3')
        cursor = connection.cursor()
        data = cursor.execute(f" SELECT * FROM flaskapi_table WHERE id={id}")
        return jsonify({'data': data.fetchall()}) 

At the end of the script, we use the if __name__ == '__main__': condition to ensure that the Flask application runs only if the script is run directly, not when it is imported as a module. We set the debug parameter to True to enable debug mode, which will display detailed error messages in case an error occurs while running the application.

#flaskapi.py
from flask import Flask, jsonify
import sqlite3
app = Flask(__name__)
@app.route('/data', methods=['GET'])
def get_data():
    connection = sqlite3.connect('db.sqlite3')
    cursor = connection.cursor()
    data = cursor.execute(" SELECT * FROM flaskapi_table")
    return jsonify({'data': data.fetchall()})
@app.route('/data/<\int:id>', methods=['GET'])
def get_id(id):
    connection = sqlite3.connect('db.sqlite3')
    cursor = connection.cursor()
    data = cursor.execute(f" SELECT * FROM flaskapi_table WHERE id={id}")
    return jsonify({'data': data.fetchall()})
if __name__ == '__main__':
    app.run(debug=True) 

Conclusion

In conclusion, the scraper.py script scrapes data from a web page and stores it in an SQLite database, while the flaskapi.py script defines a Flask API that serves the data from the SQLite database as JSON. These two scripts demonstrate how to scrape data from a website and serve it as a REST API using Flask. This approach can be useful for various web scraping applications that require data to be served via an API.