Fetching Singapore flat resale data

6 minute read

Motivation

In this exercise, I will create a script to download information from www.data.gov.sg, the Singapore government’s one-stop portal to its publicly-available datasets. Specifically, this script will fetch information about the resale of HDB flats from one of data.gov.sg’s publicly available datasets – Resale Flat Prices. The original provider of this dataset is the Housing and Development Board (HDB). This is part of my bigger project to create a machine learning product that can forecast house prices in Singapore.

The dataset contains the following features:

  1. month - the month in which the sale of the flat took place
  2. town - the town (region) in which the flat is located
  3. flat_type - the flat’s type
  4. block - the block in which the flat is located
  5. street_name - the street on which the flat is located
  6. storey_range - the storey range in which the flat unit is located
  7. floor_area_sqm - the flat’s floor area in square meters
  8. flat_model - the model of the flat.
  9. lease_commence_date - the year in which the flat’s lease commenced
  10. remaining lease - the remainig lease of the flat in years
  11. resale price - the resale price of the flat in Singapore dollars

The dataset was last updated on 9 January 2019 and covers HDB resale transactions from 1 January 2015 to 31 December 2018.

Engineering considerations

In choosing this dataset and in designing this script, I took into account several concerns that I believe are critical to the success of the project. These concerns, as well as my thought process as to how my script addresses these concerns, are elaborated below.

  • Usefulness. I believe the data fetched using this script contains several features that are useful in building machine learning models that will help Real Real Estate Agency Pte Ltd achieve its objective – to provide interested homebuyers with realistic house price forecasts.

  • Trustworthiness. Accuracy and trustworthiness of the source data are of paramount importance in data analysis. I have chosen a dataset provided by HDB, as HDB is a Singapore government agency and one can be certain of the accuracy and trustworthiness of the housing data provided by HDB.

  • Up-to-dateness. The dataset is updated monthly; as such, users can be certain that the data obtained using this script is always up to date to the latest month.

  • Ease of retrieval. In addition, the platform on which the data is hosted, www.data.gov.sg, has made available a well-documented API using which this information can be conveniently fetched.

  • Flexibility I have designed this script in a way that provides flexibility to users. This script consists of a single Python function, download_data, with which users can export the data in one of the three specified formats: json object, pandas dataframe or csv files. These are the three data formats that are among the commonly used formats in data science. In addition, I have also included some data cleaning steps prior to exporting to this information to pandas dataframe or csv files. These cleaning steps will faciliate downstream data analysis.

  • Robustness I have included in my script several assertions, which will produce unambiguous error messages when an error occurs.

# Importing dependencies
import requests
import pandas as pd
import datetime
import json
# Main function

def download_data(selection = "json", month = "2018-12", limit = 100, offset = 0):
    """
    Arguments:
        - selection: string, default "json"
              Format of the data returned by the function. User to choose from "json", "pandas" and "csv".
        - month: string with format('YYYY-MM'), default "2018-12"
              The month in which the sale of the flat took place.
        - limit: int, default 100
              The number of rows to be fetched.
        - offset: int, default 0
              The number of rows to be skipped.

    Returns:
        - Data containing HDB resale prices on a specific month between 2015 and 2018 in user's specified format.
          The data are arranged from the most recent to the oldest
    """
    assert type(selection) is str, "'selection' has to be a string"
    assert selection in ["json", "pandas", "csv"], "wrong format chosen"
    assert type(month) is str, "'month' has to be a string with the format 'YYYY-MM'"
    assert len(month) is 7, "'month' has to be a string with the format 'YYYY-MM'"
    assert month[:4] in ["2015", "2016","2017","2018"], "Wrong year; this dataset only contains data from 2015 to 2018"
    assert int(month[-2:]) in list(range(1,13)), "Wrong month; the 'month' portion of your 'month' variable should be a two-digit number between 01 and 12"
    assert type(limit) is int, "'limit' has to be an integer"
    assert type(offset) is int, "'offset' has to be an integer"
    assert limit >= 0, "'limit' has to be equal to or greater than zero"
    assert offset >= 0, "'offset' has to be equal to or greater than zero"

    # Constructing URL that takes into account user's preferences (month, limit and offset)
    url = "https://data.gov.sg/api/action/datastore_search?resource_id=1b702208-44bf-4829-b620-4615ee19b57c&limit=" + \
          str(limit) + \
          "&offset=" + \
          str(offset) + \
          "&q=%7B%22month%22%3A%20%22" + \
          month + \
          "%22%7D&sort=%22_id%22%20desc"

    # Printing disclaimer about the source and the licensing information about the data
    now = datetime.datetime.today().strftime('%d %B %Y')
    disclaimer = "This code contains information from the dataset 'Resale Flat Prices' accessed on {} from www.data.gov.sg. This information is made available under the terms of the Singapore Open Data Licence version 1.0."
    print(disclaimer.format(now))

    # Creating a connection object
    connection = requests.get(url)
    print("-----------------------------------------------------------------------------")
    if connection.status_code == 200:
        print("Status code of {} was obtained; Connection was successfully established.".format(connection.status_code))
    else:
        print("Status code of {} was obtained; Connection was NOT successfully established.".format(connection.status_code))
    print("-----------------------------------------------------------------------------")

    # Downloading the information into a json object
    data_json = connection.json()
    data_json = data_json["result"]["records"]

    # Returning the information in a format specified by the user
    # If the selection is "json", the function directly returns the json object
    if selection == "json":
        return(data_json)

    # If the selection is "pandas", the function returns a pandas dataframe.
    # Several cleaning steps are included. Some of these steps convert the `month` and `lease_commencement_date` to pandas' datetime object for easier analysis.
    # There are also steps to drop the columns that are of little use for data analysis, full count and ID.
    elif selection == "pandas":
        data_json_str = json.dumps(data_json)
        df = pd.read_json(data_json_str)
        df["month"] = pd.to_datetime(df["month"], format = "%Y-%m")
        df["lease_commence_date"] = pd.to_datetime(df["lease_commence_date"], format = "%Y")
        df.drop(["_full_count", "_id"], axis = 1, inplace = True)
        return df

    # If the selection is "csv", the function returns a csv file, 'result.csv'. The aforementioned cleaning steps are included.
    elif selection == "csv":
        data_json_str = json.dumps(data_json)
        df = pd.read_json(data_json_str)
        df["month"] = pd.to_datetime(df["month"], format = "%Y-%m")
        df["lease_commence_date"] = pd.to_datetime(df["lease_commence_date"], format = "%Y")
        df.drop(["_full_count", "_id"], axis = 1, inplace = True)
        print("Data has been downloaded into the file 'result.csv'")
        return df.to_csv("result.csv")


Demonstration

Below, we will see how the script works in action. I will use the script to create a pandas dataframe containing 200 HDB flat resale transactions that took place in March 2016.

df = download_data(selection = "pandas", month = "2016-03", limit = 200, offset = 0)

print("\n ---dataframe preview---\n")
display(df.head(3))

print("\n ---dataframe shape---\n")
display(df.shape)

print("\n ---dataframe information---\n")
display(df.info())
This code contains information from the dataset 'Resale Flat Prices' accessed on 20 January 2019 from www.data.gov.sg. This information is made available under the terms of the Singapore Open Data Licence version 1.0.
-----------------------------------------------------------------------------
Status code of 200 was obtained; Connection was successfully established.
-----------------------------------------------------------------------------

 ---dataframe preview---
block flat_model flat_type floor_area_sqm lease_commence_date month rank month remaining_lease resale_price storey_range street_name town
0 605 Multi Generation MULTI-GENERATION 163 1988-01-01 2016-03-01 0.090167 70 688000 01 TO 03 YISHUN ST 61 YISHUN
1 724 Apartment EXECUTIVE 145 1986-01-01 2016-03-01 0.090167 69 650000 10 TO 12 YISHUN ST 71 YISHUN
2 877 Maisonette EXECUTIVE 145 1987-01-01 2016-03-01 0.090167 70 648000 04 TO 06 YISHUN ST 81 YISHUN
 ---dataframe shape---




(200, 12)



 ---dataframe information---

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 0 to 199
Data columns (total 12 columns):
block                  200 non-null object
flat_model             200 non-null object
flat_type              200 non-null object
floor_area_sqm         200 non-null int64
lease_commence_date    200 non-null datetime64[ns]
month                  200 non-null datetime64[ns]
rank month             200 non-null float64
remaining_lease        200 non-null int64
resale_price           200 non-null int64
storey_range           200 non-null object
street_name            200 non-null object
town                   200 non-null object
dtypes: datetime64[ns](2), float64(1), int64(3), object(6)
memory usage: 20.3+ KB



None

This script can be conveniently used in a broader data science project. Below, I have provided an example of a downstream exploratory step that can be undertaken after fetching this data. In this step, I create a pairplot that showcases relationships among the various features of the data.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x107ffa550>

png

Leave a comment