Data Wrangling Project

29 minute read

Created with: Git, Github, Atom, Anaconda, Jupyter Notebook and Python programming Language

What is Data Wrangling?

Data Wrangling is the process of converting and mapping data from its raw form to another format with the purpose of making it more valuable and appropriate for advance tasks such as Data Analytics and Machine Learning. The goals of data wrangling:

  • Reveal a “deeper intelligence” within your data, by gathering data from multiple sources
  • Provide accurate, actionable data in the hands of business analysts in a timely matter
  • Reduce the time spent collecting and organizing unruly data before it can be utilized
  • Enable data scientists and analysts to focus on the analysis of data, rather than the wrangling
  • Drive better decision-making skills by senior leaders in an organization
  • Data Scientist’s use most of their time (up to 80%) in Data Wrangling to prepare data for Machine learning.

Introduction

The purpose of this project was to clean a messy dataset for a marketing department at an ecommerce company. The marketing company analyzes the data in detail to provide business decisions. My task was to clean the data for them so that they would have a reliable dataset to analyze. The final deliverable to the marketing group was a clean dataset in tabular format that they could easily work with.

I worked with various data formats, including JSON and XML. To perform the data cleaning tasks I utilized Python, along with useful data cleaning libraries such as Pandas. All of my work was neatly documented and organized using Jupyter Notebooks.

Data Overview

In this project I worked with a JSON dataset and an XML dataset. The JSON dataset contained customer information with various fields. These fields included the following:

  • customer_id
  • date
  • purchase
  • category
  • amount
  • related_items
  • frequently_bought_together
  • city
  • state
  • zip_code
  • lat_lon

The JSON dataset with the customer information is the messy dataset that needs to be cleaned.

An XML file with location data was provided by a different part of the data cleaning team. This file contains the following fields:

  • City
  • Zipcode
  • Latitude_Longitude

The other members of the data cleaning team have assured me that this dataset is accurate. They also said that the “city”, “state”, and “lat_lon” fields are accurate in the JSON file, but the “zip_code” field is not. The “lat_lon” data in the JSON file will be cross-referenced with the location data in the XML file to correct innacurate zipcode data.

Data Cleaning Plan

To clean the data, I used a three step data cleaning plan that consisted of the following:

  1. Importing the data
  2. Cleaning the data
  3. Reshaping the data

To clean the data, I dealt with the following seven aspects of data wrangling:

  1. Consistent Data
  2. Missing Values
  3. Constant Strings
  4. Correct Types
  5. Importig XML file
  6. Pivot Table
  7. Accurate Data

Importing the Data

The first step was to import the data. The customer information data was in JSON format. I utilized the JSON library to take a look at the data.

import json

with open("customer_data_example.json") as f_in:
    data = json.load(f_in)

print(json.dumps(data, indent=2))
[
  {
    "customer_id": 100191,
    "date": "1-Jan-14",
    "purchase": "soap",
    "category": "household",
    "amount": "24.64",
    "related_items": "towels",
    "frequently_bought_together": "towels",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60605,
    "lat_lon": "41.86,-87.619"
  },
  {
    "customer_id": 100199,
    "date": "2-Jan-14",
    "purchase": "shorts",
    "category": "clothing",
    "amount": "35",
    "related_items": "belts",
    "frequently_bought_together": "sandals",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75089,
    "lat_lon": "32.924,-96.547"
  },
  {
    "customer_id": 100170,
    "date": "3-Jan-14",
    "purchase": "lawn_mower",
    "category": "outdoor",
    "amount": "89.72",
    "related_items": "shovels",
    "frequently_bought_together": "lawn bags",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19019,
    "lat_lon": "40.002,-75.118"
  },
  {
    "customer_id": 100124,
    "date": "4-Jan-14",
    "purchase": "laptop",
    "category": "electronics",
    "amount": "51.32",
    "related_items": "headphones",
    "frequently_bought_together": "headphones",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60603,
    "lat_lon": "41.88,-87.63"
  },
  {
    "customer_id": 100173,
    "date": "5-Jan-14",
    "purchase": "car wash",
    "category": "outdoor",
    "amount": "81.75",
    "related_items": "sponge",
    "frequently_bought_together": "sponge",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19102,
    "lat_lon": "39.953,-75.166"
  },
  {
    "customer_id": 100116,
    "date": "6-Jan-14",
    "purchase": "lawn mower",
    "category": "outdoor",
    "amount": "29.16",
    "related_items": "rakes",
    "frequently_bought_together": "fertilizer",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 92027,
    "lat_lon": "33.143,-117.03"
  },
  {
    "customer_id": 100105,
    "date": "7-Jan-14",
    "purchase": "grill",
    "category": "outdoor",
    "amount": "50.71",
    "related_items": "grill cleaner",
    "frequently_bought_together": "bbq sauce",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75126,
    "lat_lon": "32.745,-96.46"
  },
  {
    "customer_id": 100148,
    "date": "8-Jan-14",
    "purchase": "household cleaner",
    "category": "household",
    "amount": "35.03",
    "related_items": "spray bottles",
    "frequently_bought_together": "spray bottles",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78109,
    "lat_lon": "29.502,-98.306"
  },
  {
    "customer_id": 100118,
    "date": "9-Jan-14",
    "purchase": "slow cooker",
    "category": "appliances",
    "amount": "30.55",
    "related_items": "tupperware",
    "frequently_bought_together": "pot holders",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19102,
    "lat_lon": "39.953,-75.166"
  },
  {
    "customer_id": 100106,
    "date": "10-Jan-14",
    "purchase": "camera",
    "category": "electronics",
    "amount": "92.01",
    "related_items": "lens cleaner",
    "frequently_bought_together": "camera lens",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75126,
    "lat_lon": "32.917,-96.973"
  },
  {
    "customer_id": 100109,
    "date": "11-Jan-14",
    "purchase": "snow shovel",
    "category": "outdoor",
    "amount": "31.79",
    "related_items": "boots",
    "frequently_bought_together": "gloves",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10004,
    "lat_lon": "40.699,-74.041"
  },
  {
    "customer_id": 100153,
    "date": "12-Jan-14",
    "purchase": "shoes",
    "category": "clothing",
    "amount": "96.87",
    "related_items": "dress shoes",
    "frequently_bought_together": "dress shoes",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60605,
    "lat_lon": "41.86,-87.619"
  },
  {
    "customer_id": 100116,
    "date": "13-Jan-14",
    "purchase": "grill",
    "category": "outdoor",
    "amount": "41.91",
    "related_items": "propane tank",
    "frequently_bought_together": "bbq sauce",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 92027,
    "lat_lon": "33.143,-117.03"
  },
  {
    "customer_id": 100151,
    "date": "14-Jan-14",
    "purchase": "blender",
    "category": "appliances",
    "amount": "61.99",
    "related_items": "pitcher",
    "frequently_bought_together": "fruit",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78214,
    "lat_lon": "29.363,-98.49"
  },
  {
    "customer_id": 100191,
    "date": "15-Jan-14",
    "purchase": "shirts",
    "category": "clothing",
    "amount": "36.11",
    "related_items": "t-shirt",
    "frequently_bought_together": "t shirt",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60605,
    "lat_lon": "41.86,-87.619"
  },
  {
    "customer_id": 100185,
    "date": "16-Jan-14",
    "purchase": "toaster",
    "category": "appliances",
    "amount": "75",
    "related_items": "cream cheese",
    "frequently_bought_together": "bagels",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78214,
    "lat_lon": "29.363,-98.49"
  },
  {
    "customer_id": 100153,
    "date": "17-Jan-14",
    "purchase": "laptop",
    "category": "electronics",
    "amount": "87.08",
    "related_items": "charger",
    "frequently_bought_together": "headphones",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60605,
    "lat_lon": "41.86,-87.619"
  },
  {
    "customer_id": 100151,
    "date": "18-Jan-14",
    "purchase": "detergent",
    "category": "household",
    "amount": "41.34",
    "related_items": "fabric softener",
    "frequently_bought_together": "fabric softener",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78214,
    "lat_lon": "29.363,-98.49"
  },
  {
    "customer_id": 100196,
    "date": "19-Jan-14",
    "purchase": "shoes",
    "category": "clothing",
    "amount": "27.37",
    "related_items": "dress shoes",
    "frequently_bought_together": "shoe laces",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10004,
    "lat_lon": "40.699,-74.041"
  },
  {
    "customer_id": 100124,
    "date": "20-Jan-14",
    "purchase": "tv",
    "category": "electronics",
    "amount": "38.61",
    "related_items": "surround sound",
    "frequently_bought_together": "surround sound",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60603,
    "lat_lon": "41.88,-87.63"
  },
  {
    "customer_id": 100188,
    "date": "21-Jan-14",
    "purchase": "paper products",
    "category": "household",
    "amount": "53.8",
    "related_items": "toilet paper",
    "frequently_bought_together": "paper towels",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10002,
    "lat_lon": "40.717,-73.987"
  },
  {
    "customer_id": 100192,
    "date": "22-Jan-14",
    "purchase": "toaster",
    "category": "appliances",
    "amount": "85",
    "related_items": "bread",
    "frequently_bought_together": "butter",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 94560,
    "lat_lon": "37.536,-122.034"
  },
  {
    "customer_id": 100140,
    "date": "23-Jan-14",
    "purchase": "tools",
    "category": "house",
    "amount": "85.92",
    "related_items": "screws",
    "frequently_bought_together": "batteries",
    "city": "SD",
    "state": "CA",
    "zip_code": 92037,
    "lat_lon": "32.839,-117.262"
  },
  {
    "customer_id": 100159,
    "date": "24-Jan-14",
    "purchase": "shoes",
    "category": "clothing",
    "amount": "91.98",
    "related_items": "dress shoes",
    "frequently_bought_together": "sneakers",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 95115,
    "lat_lon": "37.189,-121.705"
  },
  {
    "customer_id": 100182,
    "date": "25-Jan-14",
    "purchase": "slow cooker",
    "category": "appliances",
    "amount": "64.86",
    "related_items": "tongs",
    "frequently_bought_together": "cookbook",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78006,
    "lat_lon": "29.852,-98.729"
  },
  {
    "customer_id": 100158,
    "date": "26-Jan-14",
    "purchase": "lawn mower",
    "category": "outdoor",
    "amount": "86.05",
    "related_items": "gardening gloves",
    "frequently_bought_together": "rakes",
    "city": "Hou",
    "state": "TX",
    "zip_code": 77005,
    "lat_lon": "29.718,-95.428"
  },
  {
    "customer_id": 100185,
    "date": "27-Jan-14",
    "purchase": "pants",
    "category": "clothing",
    "amount": "78.81",
    "related_items": "wallet",
    "frequently_bought_together": "wallet",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78214,
    "lat_lon": "29.363,-98.49"
  },
  {
    "customer_id": 100103,
    "date": "28-Jan-14",
    "purchase": "audio",
    "category": "electronics",
    "amount": "78.61",
    "related_items": "headphones",
    "frequently_bought_together": "headphones",
    "city": "Houston",
    "state": "TX",
    "zip_code": 77012,
    "lat_lon": "29.72,-95.279"
  },
  {
    "customer_id": 100150,
    "date": "29-Jan-14",
    "purchase": "shorts",
    "category": "clothing",
    "amount": "36.21",
    "related_items": "t-shirts",
    "frequently_bought_together": "sunglasses",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19019,
    "lat_lon": "40.002,-75.118"
  },
  {
    "customer_id": 100183,
    "date": "30-Jan-14",
    "purchase": "shirts",
    "category": "clothing",
    "amount": "66.56",
    "related_items": "shortsleeve",
    "frequently_bought_together": "collared shirt",
    "city": "Phoenix",
    "state": "AZ",
    "zip_code": 85004,
    "lat_lon": "33.451,-112.071"
  },
  {
    "customer_id": 100111,
    "date": "31-Jan-14",
    "purchase": "snow shovel",
    "category": "outdoor",
    "amount": "77.28",
    "related_items": "mittens",
    "frequently_bought_together": "gloves",
    "city": "Phoenix",
    "state": "AZ",
    "zip_code": 85019,
    "lat_lon": "33.512,-112.142"
  },
  {
    "customer_id": 100160,
    "date": "1-Feb-14",
    "purchase": "tv",
    "category": "electronics",
    "amount": "89.36",
    "related_items": "speakers",
    "frequently_bought_together": "surround sound",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 94560,
    "lat_lon": "37.536,-122.034"
  },
  {
    "customer_id": 100133,
    "date": "2-Feb-14",
    "purchase": "shirts",
    "category": "clothing",
    "amount": "23.69",
    "related_items": "collared shirt",
    "frequently_bought_together": "dress shirt",
    "city": "Los Angeles",
    "state": "CA",
    "zip_code": 90020,
    "lat_lon": "34.066,-118.309"
  },
  {
    "customer_id": 100158,
    "date": "3-Feb-14",
    "purchase": "snow shovel",
    "category": "outdoor",
    "amount": "85.69",
    "related_items": "mittens",
    "frequently_bought_together": "sand",
    "city": "Houston",
    "state": "TX",
    "zip_code": 77005,
    "lat_lon": "29.718,-95.428"
  },
  {
    "customer_id": 100193,
    "date": "4-Feb-14",
    "purchase": "microwave",
    "category": "appliances",
    "amount": "97.59",
    "related_items": "popcorn",
    "frequently_bought_together": "egg cooker",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 95115,
    "lat_lon": "37.189,-121.705"
  },
  {
    "customer_id": 100167,
    "date": "5-Feb-14",
    "purchase": "food processor",
    "category": "appliances",
    "amount": "32.89",
    "related_items": "vegetable peeler",
    "frequently_bought_together": "vegetable peeler",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10004,
    "lat_lon": "40.699,-74.041"
  },
  {
    "customer_id": 100102,
    "date": "6-Feb-14",
    "purchase": "soap",
    "category": "house",
    "amount": "70.66",
    "related_items": "shampoo",
    "frequently_bought_together": "loofah",
    "city": "Houston",
    "state": "TX",
    "zip_code": 77009,
    "lat_lon": "29.793,-95.367"
  },
  {
    "customer_id": 100167,
    "date": "7-Feb-14",
    "purchase": "soap",
    "category": "house",
    "amount": "76.69",
    "related_items": "body wash",
    "frequently_bought_together": "loofah",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10004,
    "lat_lon": "40.699,-74.041"
  },
  {
    "customer_id": 100185,
    "date": "8-Feb-14",
    "purchase": "camera",
    "category": "electronics",
    "amount": "54.64",
    "related_items": "editing software",
    "frequently_bought_together": "camera case",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78214,
    "lat_lon": "29.363,-98.49"
  },
  {
    "customer_id": 100140,
    "date": "9-Feb-14",
    "purchase": "food processor",
    "category": "appliances",
    "amount": "85.23",
    "related_items": "vegetable peeler",
    "frequently_bought_together": "vegetables",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 92037,
    "lat_lon": "32.839,-117.262"
  },
  {
    "customer_id": 100162,
    "date": "10-Feb-14",
    "purchase": "food processor",
    "category": "appliances",
    "amount": "70.18",
    "related_items": "vegetables",
    "frequently_bought_together": "tupperware",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 94560,
    "lat_lon": "37.536,-122.034"
  },
  {
    "customer_id": 100123,
    "date": "11-Feb-14",
    "purchase": "slow cooker",
    "category": "appliances",
    "amount": "34.57",
    "related_items": "cookbook",
    "frequently_bought_together": "serving spoon",
    "city": "Los Angeles",
    "state": "CA",
    "zip_code": 90020,
    "lat_lon": "34.066,-118.309"
  },
  {
    "customer_id": 100186,
    "date": "12-Feb-14",
    "purchase": "slow cooker",
    "category": "appliances",
    "amount": "63.54",
    "related_items": "tongs",
    "frequently_bought_together": "tupperware",
    "city": "Houston",
    "state": "TX",
    "zip_code": 77009,
    "lat_lon": "29.793,-95.367"
  },
  {
    "customer_id": 100198,
    "date": "13-Feb-14",
    "purchase": "pants",
    "category": "clothing",
    "amount": "76.43",
    "related_items": "khakis ",
    "frequently_bought_together": "wallet",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78206,
    "lat_lon": "29.438,-98.462"
  },
  {
    "customer_id": 100188,
    "date": "14-Feb-14",
    "purchase": "snow shovel",
    "category": "outdoor",
    "amount": "95.85",
    "related_items": "mittens",
    "frequently_bought_together": "sand",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10002,
    "lat_lon": "40.717,-73.987"
  },
  {
    "customer_id": 100136,
    "date": "15-Feb-14",
    "purchase": "pants",
    "category": "clothing",
    "amount": "79.43",
    "related_items": "khakis ",
    "frequently_bought_together": "wallet",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60602,
    "lat_lon": "41.883,-87.629"
  },
  {
    "customer_id": 100162,
    "date": "16-Feb-14",
    "purchase": "shirts",
    "category": "clothing",
    "amount": "76.08",
    "related_items": "t-shirt",
    "frequently_bought_together": "dress shirt",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 94560,
    "lat_lon": "37.536,-122.034"
  },
  {
    "customer_id": 100191,
    "date": "17-Feb-14",
    "purchase": "shirts",
    "category": "clothing",
    "amount": "57.33",
    "related_items": "shortsleeve",
    "frequently_bought_together": "button down shirt",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60605,
    "lat_lon": "41.86,-87.619"
  },
  {
    "customer_id": 100120,
    "date": "18-Feb-14",
    "purchase": "jackets",
    "category": "clothing",
    "amount": "86.29",
    "related_items": "scarfs",
    "frequently_bought_together": "winter gloves",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75001,
    "lat_lon": "32.961,-96.838"
  },
  {
    "customer_id": 100106,
    "date": "19-Feb-14",
    "purchase": "cell phone",
    "category": "electronics",
    "amount": "91.87",
    "related_items": "screen cleaner",
    "frequently_bought_together": "cell phone case",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75063,
    "lat_lon": "32.917,-96.973"
  },
  {
    "customer_id": 100192,
    "date": "20-Feb-14",
    "purchase": "pants",
    "category": "clothing",
    "amount": "78.11",
    "related_items": "wallet",
    "frequently_bought_together": "wallet",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 94560,
    "lat_lon": "37.536,-122.034"
  },
  {
    "customer_id": 100107,
    "date": "21-Feb-14",
    "purchase": "paper products",
    "category": "household",
    "amount": "90.84",
    "related_items": "paper plates",
    "frequently_bought_together": "paper towels",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 92102,
    "lat_lon": "32.715,-117.125"
  },
  {
    "customer_id": 100142,
    "date": "22-Feb-14",
    "purchase": "flower pot",
    "category": "outdoor",
    "amount": "59.02",
    "related_items": "plant food",
    "frequently_bought_together": "flower seeds",
    "city": "Phoenix",
    "state": "AZ",
    "zip_code": 85001,
    "lat_lon": "33.704,-112.352"
  },
  {
    "customer_id": 100131,
    "date": "23-Feb-14",
    "purchase": "grill",
    "category": "outdoor",
    "amount": "49.41",
    "related_items": "grill cleaner",
    "frequently_bought_together": "steak seasoning",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78206,
    "lat_lon": "29.438,-98.462"
  },
  {
    "customer_id": 100115,
    "date": "24-Feb-14",
    "purchase": "pants",
    "category": "clothing",
    "amount": "99.06",
    "related_items": "slacks",
    "frequently_bought_together": "khakis ",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10013,
    "lat_lon": "40.721,-74.005"
  },
  {
    "customer_id": 100194,
    "date": "25-Feb-14",
    "purchase": "shirts",
    "category": "clothing",
    "amount": "77.83",
    "related_items": "button down shirt",
    "frequently_bought_together": "dress shirt",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19115,
    "lat_lon": "40.093,-75.041"
  },
  {
    "customer_id": 100133,
    "date": "26-Feb-14",
    "purchase": "car wash",
    "category": "outdoor",
    "amount": "49.57",
    "related_items": "wax",
    "frequently_bought_together": "tire cleaner",
    "city": "Los Angeles",
    "state": "CA",
    "zip_code": 90020,
    "lat_lon": "34.066,-118.309"
  },
  {
    "customer_id": 100108,
    "date": "27-Feb-14",
    "purchase": "pants",
    "category": "clothing",
    "amount": "90.84",
    "related_items": "slacks",
    "frequently_bought_together": "wallet",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 91911,
    "lat_lon": "32.609,-117.061"
  },
  {
    "customer_id": 100120,
    "date": "28-Feb-14",
    "purchase": "tv",
    "category": "electronics",
    "amount": "41.18",
    "related_items": "dvd player",
    "frequently_bought_together": "video game console",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75001,
    "lat_lon": "32.961,-96.838"
  },
  {
    "customer_id": 100191,
    "date": "1-Mar-14",
    "purchase": "jackets",
    "category": "clothing",
    "amount": "65.72",
    "related_items": "scarfs",
    "frequently_bought_together": "winter gloves",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60605,
    "lat_lon": "41.86,-87.619"
  },
  {
    "customer_id": 100189,
    "date": "2-Mar-14",
    "purchase": "slow cooker",
    "category": "appliances",
    "amount": "90.89",
    "related_items": "serving spoon",
    "frequently_bought_together": "tongs",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75001,
    "lat_lon": "32.961,-96.838"
  },
  {
    "customer_id": 100192,
    "date": "3-Mar-14",
    "purchase": "laptop",
    "category": "electronics",
    "amount": "46.34",
    "related_items": "charger",
    "frequently_bought_together": "wireless mouse",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 94560,
    "lat_lon": "37.536,-122.034"
  },
  {
    "customer_id": 100132,
    "date": "4-Mar-14",
    "purchase": null,
    "category": "electronics",
    "amount": "97.54",
    "related_items": "subwoofer",
    "frequently_bought_together": "headphones",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10013,
    "lat_lon": "40.721,-74.005"
  },
  {
    "customer_id": 100143,
    "date": "5-Mar-14",
    "purchase": null,
    "category": "clothing",
    "amount": "52.99",
    "related_items": "t-shirt",
    "frequently_bought_together": "shortsleeve",
    "city": "Phoenix",
    "state": "AZ",
    "zip_code": 85015,
    "lat_lon": "33.507,-112.103"
  },
  {
    "customer_id": 100101,
    "date": "6-Mar-14",
    "purchase": null,
    "category": "appliances",
    "amount": "51.46",
    "related_items": "mixing spoon",
    "frequently_bought_together": "vegetable peeler",
    "city": "Los Angeles",
    "state": "CA",
    "zip_code": 75001,
    "lat_lon": "34.09,-118.295"
  },
  {
    "customer_id": 100129,
    "date": "7-Mar-14",
    "purchase": null,
    "category": "clothing",
    "amount": "64.7",
    "related_items": "belts",
    "frequently_bought_together": "wallets",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75126,
    "lat_lon": "32.745,-96.46"
  },
  {
    "customer_id": 100100,
    "date": "8-Mar-14",
    "purchase": null,
    "category": "appliances",
    "amount": "98.25",
    "related_items": "bowls",
    "frequently_bought_together": "forks",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60610,
    "lat_lon": "41.899,-87.637"
  },
  {
    "customer_id": 100154,
    "date": "9-Mar-14",
    "purchase": "cell phone",
    "category": "electronics",
    "amount": "21.32",
    "related_items": "charger",
    "frequently_bought_together": "cell phone case",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78206,
    "lat_lon": "29.438,-98.462"
  },
  {
    "customer_id": 100168,
    "date": "10-Mar-14",
    "purchase": "jackets",
    "category": "clothing",
    "amount": "58.43",
    "related_items": "winter gloves",
    "frequently_bought_together": "winter gloves",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78109,
    "lat_lon": "29.502,-98.306"
  },
  {
    "customer_id": 100138,
    "date": "11-Mar-14",
    "purchase": "soap",
    "category": "household",
    "amount": "28.74",
    "related_items": "shampoo",
    "frequently_bought_together": "shampoo",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 91911,
    "lat_lon": "32.609,-117.061"
  },
  {
    "customer_id": 100165,
    "date": "12-Mar-14",
    "purchase": "tools",
    "category": "household",
    "amount": "61.12",
    "related_items": "drills ",
    "frequently_bought_together": "drill bits",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 91911,
    "lat_lon": "32.609,-117.061"
  },
  {
    "customer_id": 100111,
    "date": "13-Mar-14",
    "purchase": "slow cooker",
    "category": "appliances",
    "amount": "82.12",
    "related_items": "tupperware",
    "frequently_bought_together": "serving spoon",
    "city": "Phoenix",
    "state": "AZ",
    "zip_code": 85019,
    "lat_lon": "33.512,-112.142"
  },
  {
    "customer_id": 100124,
    "date": "14-Mar-14",
    "purchase": "soap",
    "category": "household",
    "amount": "58.73",
    "related_items": "body wash",
    "frequently_bought_together": "bar soap",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60603,
    "lat_lon": "41.88,-87.63"
  },
  {
    "customer_id": 100119,
    "date": "15-Mar-14",
    "purchase": "flower pot",
    "category": "outdoor",
    "amount": "26.66",
    "related_items": "plant food",
    "frequently_bought_together": "tomato stakes",
    "city": "San Antonio",
    "state": "TX",
    "zip_code": 78073,
    "lat_lon": "29.227,-98.609"
  },
  {
    "customer_id": 100116,
    "date": "16-Mar-14",
    "purchase": "car wash",
    "category": "outdoor",
    "amount": "72.4",
    "related_items": "sponge",
    "frequently_bought_together": "sponge",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 92027,
    "lat_lon": "33.143,-117.03"
  },
  {
    "customer_id": 100132,
    "date": "17-Mar-14",
    "purchase": "cell phone",
    "category": "electronics",
    "amount": "45.11",
    "related_items": "cell phone case",
    "frequently_bought_together": "cell phone case",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10013,
    "lat_lon": "40.721,-74.005"
  },
  {
    "customer_id": 100106,
    "date": "18-Mar-14",
    "purchase": "microwave",
    "category": "appliances",
    "amount": "26.48",
    "related_items": "popcorn",
    "frequently_bought_together": "forks",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75063,
    "lat_lon": "32.917,-96.973"
  },
  {
    "customer_id": 100103,
    "date": "19-Mar-14",
    "purchase": "jackets",
    "category": "clothing",
    "amount": "46.02",
    "related_items": "scarfs",
    "frequently_bought_together": "scarfs",
    "city": "Houston",
    "state": "TX",
    "zip_code": 77012,
    "lat_lon": "29.72,-95.279"
  },
  {
    "customer_id": 100192,
    "date": "20-Mar-14",
    "purchase": "laptop",
    "category": "elect^ronics",
    "amount": "33.93",
    "related_items": "wireless mouse",
    "frequently_bought_together": "headphones",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 94560,
    "lat_lon": "37.536,-122.034"
  },
  {
    "customer_id": 100180,
    "date": "21-Mar-14",
    "purchase": "jackets",
    "category": "clothing",
    "amount": "48.03",
    "related_items": "mittens",
    "frequently_bought_together": "winter gloves",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 92027,
    "lat_lon": "33.143,-117.03"
  },
  {
    "customer_id": 100165,
    "date": "22-Mar-14",
    "purchase": "detergent",
    "category": "household",
    "amount": "81.56",
    "related_items": "dryer sheets",
    "frequently_bought_together": "fabric softener",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 91911,
    "lat_lon": "32.609,-117.061"
  },
  {
    "customer_id": 100116,
    "date": "23-Mar-14",
    "purchase": "car wash",
    "category": "outdoor",
    "amount": "23.34",
    "related_items": "towels",
    "frequently_bought_together": "wax",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 92027,
    "lat_lon": "33.143,-117.03"
  },
  {
    "customer_id": 100149,
    "date": "24-Mar-14",
    "purchase": "cell phone",
    "category": "electronics",
    "amount": "70.36",
    "related_items": "charger",
    "frequently_bought_together": "cell phone case",
    "city": "Houston",
    "state": "TX",
    "zip_code": 77009,
    "lat_lon": "29.793,-95.367"
  },
  {
    "customer_id": 100118,
    "date": "25-Mar-14",
    "purchase": "shirts",
    "category": "clothing",
    "amount": "82.35",
    "related_items": "shortsleeve",
    "frequently_bought_together": "shortsleeve",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19102,
    "lat_lon": "39.953,-75.166"
  },
  {
    "customer_id": 100128,
    "date": "26-Mar-14",
    "purchase": "detergent",
    "category": "household",
    "amount": "94.16",
    "related_items": "dryer sheets",
    "frequently_bought_together": "bleach",
    "city": "Phoenix",
    "state": "AZ",
    "zip_code": 60603,
    "lat_lon": "33.507,-112.103"
  },
  {
    "customer_id": 100142,
    "date": "27-Mar-14",
    "purchase": "detergent",
    "category": "household",
    "amount": "91.23",
    "related_items": "bleach",
    "frequently_bought_together": "fabric softener",
    "city": "Phoenix",
    "state": "AZ",
    "zip_code": 85001,
    "lat_lon": "33.704,-112.352"
  },
  {
    "customer_id": 100195,
    "date": "28-Mar-14",
    "purchase": "car wash",
    "category": "outdoor",
    "amount": "81.17",
    "related_items": "buckets",
    "frequently_bought_together": "towels",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19102,
    "lat_lon": "39.953,-75.166"
  },
  {
    "customer_id": 100123,
    "date": "29-Mar-14",
    "purchase": "shirts",
    "category": "clothing",
    "amount": "89.68",
    "related_items": "collared shirt",
    "frequently_bought_together": "shortsleeve",
    "city": "Los Angeles",
    "state": "CA",
    "zip_code": 90020,
    "lat_lon": "34.066,-118.309"
  },
  {
    "customer_id": 100145,
    "date": "30-Mar-14",
    "purchase": "blender",
    "category": "appliances",
    "amount": "74.52",
    "related_items": "pitcher",
    "frequently_bought_together": "straws",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60603,
    "lat_lon": "41.88,-87.63"
  },
  {
    "customer_id": 100199,
    "date": "31-Mar-14",
    "purchase": null,
    "category": "clothing",
    "amount": "26.81",
    "related_items": "sunglasses",
    "frequently_bought_together": "sunglasses",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75089,
    "lat_lon": "32.924,-96.547"
  },
  {
    "customer_id": 100167,
    "date": "1-Apr-14",
    "purchase": null,
    "category": "appliances",
    "amount": "58.97",
    "related_items": "cookbook",
    "frequently_bought_together": "tupperware",
    "city": "New York City",
    "state": "NY",
    "zip_code": 10004,
    "lat_lon": "40.699,-74.041"
  },
  {
    "customer_id": 100187,
    "date": "2-Apr-14",
    "purchase": null,
    "category": "outdoor",
    "amount": "90.37",
    "related_items": "rakes",
    "frequently_bought_together": "gardening gloves",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 91911,
    "lat_lon": "41.905,-87.625"
  },
  {
    "customer_id": 100156,
    "date": "3-Apr-14",
    "purchase": null,
    "category": "electronics",
    "amount": "37.32",
    "related_items": "charger",
    "frequently_bought_together": "headphones",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 91911,
    "lat_lon": "32.609,-117.061"
  },
  {
    "customer_id": 100166,
    "date": "4-Apr-14",
    "purchase": null,
    "category": "electronics",
    "amount": "44.31",
    "related_items": "surround sound",
    "frequently_bought_together": "dvd player",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19115,
    "lat_lon": "40.093,-75.041"
  },
  {
    "customer_id": 100124,
    "date": "5-Apr-14",
    "purchase": null,
    "category": "appliances",
    "amount": "53.94",
    "related_items": "cookbook",
    "frequently_bought_together": "tongs",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60603,
    "lat_lon": "41.88,-87.63"
  },
  {
    "customer_id": 100162,
    "date": "6-Apr-14",
    "purchase": "paper products",
    "category": "household",
    "amount": "69.28",
    "related_items": "paper towels",
    "frequently_bought_together": "toilet paper",
    "city": "San Jose",
    "state": "CA",
    "zip_code": 94560,
    "lat_lon": "37.536,-122.034"
  },
  {
    "customer_id": 100187,
    "date": "7-Apr-14",
    "purchase": "audio",
    "category": "electronics",
    "amount": "65.79",
    "related_items": "earbuds",
    "frequently_bought_together": "headphones",
    "city": "Chicago",
    "state": "IL",
    "zip_code": 60611,
    "lat_lon": "41.905,-87.625"
  },
  {
    "customer_id": 100186,
    "date": "8-Apr-14",
    "purchase": "laptop",
    "category": "electronics",
    "amount": "97.6",
    "related_items": "charger",
    "frequently_bought_together": "charger",
    "city": "Houston",
    "state": "TX",
    "zip_code": 77009,
    "lat_lon": "29.793,-95.367"
  },
  {
    "customer_id": 100190,
    "date": "9-April-2014",
    "purchase": "tv",
    "category": "electronics",
    "amount": "61.33",
    "related_items": "speakers",
    "frequently_bought_together": "dvd player",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75089,
    "lat_lon": "32.924,-96.547"
  },
  {
    "customer_id": 100152,
    "date": "10-April-2014",
    "purchase": "tools",
    "category": "household",
    "amount": "90.48",
    "related_items": "hammers",
    "frequently_bought_together": "drills ",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75126,
    "lat_lon": "32.745,-96.46"
  },
  {
    "customer_id": 100103,
    "date": "11-April-2014",
    "purchase": "slow cooker",
    "category": "appliances",
    "amount": "86.52",
    "related_items": "pot holders",
    "frequently_bought_together": "cookbook",
    "city": "Houston",
    "state": "TX",
    "zip_code": 77012,
    "lat_lon": "29.72,-95.279"
  },
  {
    "customer_id": 100194,
    "date": "12-April-2014",
    "purchase": "camera",
    "category": "electronics",
    "amount": "67.2",
    "related_items": "editing software",
    "frequently_bought_together": "lens cleaner",
    "city": "Philadelphia",
    "state": "PA",
    "zip_code": 19115,
    "lat_lon": "40.093,-75.041"
  },
  {
    "customer_id": 100130,
    "date": "13-April-2014",
    "purchase": "blender",
    "category": "appliances",
    "amount": "25.83",
    "related_items": "pitcher",
    "frequently_bought_together": "fruit",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 92107,
    "lat_lon": "32.741,-117.244"
  },
  {
    "customer_id": 100152,
    "date": "14-April-2014",
    "purchase": "car wash",
    "category": "outdoor",
    "amount": "50.12",
    "related_items": "tire cleaner",
    "frequently_bought_together": "wax",
    "city": "Dallas",
    "state": "TX",
    "zip_code": 75126,
    "lat_lon": "32.745,-96.46"
  },
  {
    "customer_id": 100140,
    "date": "15-April-2014",
    "purchase": "shirts",
    "category": "clothing",
    "amount": "53.09",
    "related_items": "t-shirt",
    "frequently_bought_together": "button down shirt",
    "city": "San Diego",
    "state": "CA",
    "zip_code": 92037,
    "lat_lon": "32.839,-117.262"
  }
]

After getting a feel for the data and the different fields, I utilized the Pandas library to import the data into a Pandas dataframe. This is a table-like format in Pandas that makes the data easier to work with for cleaning and reshaping purposes.

import pandas as pd
df = pd.read_json("customer_data_example.json")

df
amount category city customer_id date frequently_bought_together lat_lon purchase related_items state zip_code
0 24.64 household Chicago 100191 2014-01-01 towels 41.86,-87.619 None towels IL 60605
1 35.00 clothing Dallas 100199 2014-01-02 sandals 32.924,-96.547 shorts belts TX 75089
2 89.72 outdoor Philadelphia 100170 2014-01-03 lawn bags 40.002,-75.118 lawn_mower shovels PA 19019
3 51.32 electronics Chicago 100124 2014-01-04 headphones 41.88,-87.63 laptop headphones IL 60603
4 81.75 outdoor Philadelphia 100173 2014-01-05 sponge 39.953,-75.166 car wash sponge PA 19102
5 29.16 outdoor San Diego 100116 2014-01-06 fertilizer 33.143,-117.03 lawn mower rakes CA 92027
6 50.71 outdoor Dallas 100105 2014-01-07 bbq sauce 32.745,-96.46 grill grill cleaner TX 75126
7 35.03 household San Antonio 100148 2014-01-08 spray bottles 29.502,-98.306 household cleaner spray bottles TX 78109
8 30.55 appliances Philadelphia 100118 2014-01-09 pot holders 39.953,-75.166 None tupperware PA 19102
9 92.01 electronics Dallas 100106 2014-01-10 camera lens 32.917,-96.973 camera lens cleaner TX 75126
10 31.79 outdoor New York City 100109 2014-01-11 gloves 40.699,-74.041 snow shovel boots NY 10004
11 96.87 clothing Chicago 100153 2014-01-12 dress shoes 41.86,-87.619 shoes dress shoes IL 60605
12 41.91 outdoor San Diego 100116 2014-01-13 bbq sauce 33.143,-117.03 grill propane tank CA 92027
13 61.99 appliances San Antonio 100151 2014-01-14 fruit 29.363,-98.49 blender pitcher TX 78214
14 36.11 clothing Chicago 100191 2014-01-15 t shirt 41.86,-87.619 shirts t-shirt IL 60605
15 75.00 appliances San Antonio 100185 2014-01-16 bagels 29.363,-98.49 toaster cream cheese TX 78214
16 87.08 electronics Chicago 100153 2014-01-17 headphones 41.86,-87.619 laptop charger IL 60605
17 41.34 household San Antonio 100151 2014-01-18 fabric softener 29.363,-98.49 detergent fabric softener TX 78214
18 27.37 clothing New York City 100196 2014-01-19 shoe laces 40.699,-74.041 shoes dress shoes NY 10004
19 38.61 electronics Chicago 100124 2014-01-20 surround sound 41.88,-87.63 tv surround sound IL 60603
20 53.80 household New York City 100188 2014-01-21 paper towels 40.717,-73.987 n/a toilet paper NY 10002
21 85.00 appliances San Jose 100192 2014-01-22 butter 37.536,-122.034 toaster bread CA 94560
22 85.92 house SD 100140 2014-01-23 batteries 32.839,-117.262 tools screws CA 92037
23 91.98 clothing San Jose 100159 2014-01-24 sneakers 37.189,-121.705 shoes dress shoes CA 95115
24 64.86 appliances San Antonio 100182 2014-01-25 cookbook 29.852,-98.729 slow cooker tongs TX 78006
25 86.05 outdoor Hou 100158 2014-01-26 rakes 29.718,-95.428 lawn mower gardening gloves TX 77005
26 78.81 clothing San Antonio 100185 2014-01-27 wallet 29.363,-98.49 pants wallet TX 78214
27 78.61 electronics Houston 100103 2014-01-28 headphones 29.72,-95.279 audio headphones TX 77012
28 36.21 clothing Philadelphia 100150 2014-01-29 sunglasses 40.002,-75.118 shorts t-shirts PA 19019
29 66.56 clothing Phoenix 100183 2014-01-30 collared shirt 33.451,-112.071 shirts shortsleeve AZ 85004
... ... ... ... ... ... ... ... ... ... ... ...
75 45.11 electronics New York City 100132 2014-03-17 cell phone case 40.721,-74.005 cell phone cell phone case NY 10013
76 26.48 appliances Dallas 100106 2014-03-18 forks 32.917,-96.973 microwave popcorn TX 75063
77 46.02 clothing Houston 100103 2014-03-19 scarfs 29.72,-95.279 jackets scarfs TX 77012
78 33.93 elect^ronics San Jose 100192 2014-03-20 headphones 37.536,-122.034 laptop wireless mouse CA 94560
79 48.03 clothing San Diego 100180 2014-03-21 winter gloves 33.143,-117.03 jackets mittens CA 92027
80 81.56 household San Diego 100165 2014-03-22 fabric softener 32.609,-117.061 detergent dryer sheets CA 91911
81 23.34 outdoor San Diego 100116 2014-03-23 wax 33.143,-117.03 car wash towels CA 92027
82 70.36 electronics Houston 100149 2014-03-24 cell phone case 29.793,-95.367 cell phone charger TX 77009
83 82.35 clothing Philadelphia 100118 2014-03-25 shortsleeve 39.953,-75.166 shirts shortsleeve PA 19102
84 94.16 household Phoenix 100128 2014-03-26 bleach 33.507,-112.103 detergent dryer sheets AZ 60603
85 91.23 household Phoenix 100142 2014-03-27 fabric softener 33.704,-112.352 detergent bleach AZ 85001
86 81.17 outdoor Philadelphia 100195 2014-03-28 towels 39.953,-75.166 car wash buckets PA 19102
87 89.68 clothing Los Angeles 100123 2014-03-29 shortsleeve 34.066,-118.309 shirts collared shirt CA 90020
88 74.52 appliances Chicago 100145 2014-03-30 straws 41.88,-87.63 blender pitcher IL 60603
89 26.81 clothing Dallas 100199 2014-03-31 sunglasses 32.924,-96.547 None sunglasses TX 75089
90 58.97 appliances New York City 100167 2014-04-01 tupperware 40.699,-74.041 None cookbook NY 10004
91 90.37 outdoor Chicago 100187 2014-04-02 gardening gloves 41.905,-87.625 None rakes IL 91911
92 37.32 electronics San Diego 100156 2014-04-03 headphones 32.609,-117.061 None charger CA 91911
93 44.31 electronics Philadelphia 100166 2014-04-04 dvd player 40.093,-75.041 None surround sound PA 19115
94 53.94 appliances Chicago 100124 2014-04-05 tongs 41.88,-87.63 None cookbook IL 60603
95 69.28 household San Jose 100162 2014-04-06 toilet paper 37.536,-122.034 paper products paper towels CA 94560
96 65.79 electronics Chicago 100187 2014-04-07 headphones 41.905,-87.625 audio earbuds IL 60611
97 97.60 electronics Houston 100186 2014-04-08 charger 29.793,-95.367 laptop charger TX 77009
98 61.33 electronics Dallas 100190 2014-04-09 dvd player 32.924,-96.547 tv speakers TX 75089
99 90.48 household Dallas 100152 2014-04-10 drills 32.745,-96.46 tools hammers TX 75126
100 86.52 appliances Houston 100103 2014-04-11 cookbook 29.72,-95.279 slow cooker pot holders TX 77012
101 67.20 electronics Philadelphia 100194 2014-04-12 lens cleaner 40.093,-75.041 camera editing software PA 19115
102 25.83 appliances San Diego 100130 2014-04-13 fruit 32.741,-117.244 blender pitcher CA 92107
103 50.12 outdoor Dallas 100152 2014-04-14 wax 32.745,-96.46 car wash tire cleaner TX 75126
104 53.09 clothing San Diego 100140 2014-04-15 button down shirt 32.839,-117.262 shirts t-shirt CA 92037

105 rows × 11 columns

I also took a look at the data types for each field.

df.dtypes
amount                               float64
category                              object
city                                  object
customer_id                            int64
date                          datetime64[ns]
frequently_bought_together            object
lat_lon                               object
purchase                              object
related_items                         object
state                                 object
zip_code                               int64
dtype: object

After importing the JSON data, I imported the XML data. I did this by first parsing the XML data to collect the tags and elements to store in a dictionary. The dictionary was then converted to a dataframe.

import xml.etree.ElementTree as ET
tree = ET.parse('location_data.xml')
root = tree.getroot()

# root tag
print(root.tag)

# child tag
print(root[0].tag)

# number of children elements
num_children = len(root.getchildren())
print(num_children)

# number of subchildren elements
num_subchildren = len(root[0].getchildren())
print(num_subchildren)
    
    
# pulling out all of the subchildren tags
tags = []
for subchild in root[0]:
    tags.append(subchild.tag)
print(tags)
    

# creating an empty dictionary to store the data
d = {}
for tag in tags:
    d[tag] = []
print(d)


# pulling out all of the data
for i in range(0, num_children):
    for j in range(0, num_subchildren):
        value = root[i][j].text
        d[tags[j]].append(value)
        
#print(d)


# converting to a dataframe
df = pd.DataFrame(data=d)

print(df)
data-set
record
50
3
['City', 'Zipcode', 'Latitude_Longitude']
{'City': [], 'Zipcode': [], 'Latitude_Longitude': []}
             City Zipcode Latitude_Longitude
0   New York City   10012     40.726,-73.998
1   New York City   10013     40.721,-74.005
2   New York City   10004     40.699,-74.041
3   New York City   10128      40.782,-73.95
4   New York City   10002     40.717,-73.987
5     Los Angeles   90001    33.973,-118.249
6     Los Angeles   90016     34.03,-118.353
7     Los Angeles   90008     34.01,-118.337
8     Los Angeles   90020    34.066,-118.309
9     Los Angeles   90029     34.09,-118.295
10        Chicago   60610     41.899,-87.637
11        Chicago   60611     41.905,-87.625
12        Chicago   60605      41.86,-87.619
13        Chicago   60602     41.883,-87.629
14        Chicago   60603       41.88,-87.63
15        Houston   77001      29.813,-95.31
16        Houston   77005     29.718,-95.428
17        Houston   77009     29.793,-95.367
18        Houston   77004     29.729,-95.366
19        Houston   77012      29.72,-95.279
20   Philadelphia   19019     40.002,-75.118
21   Philadelphia   19102     39.953,-75.166
22   Philadelphia   19110      39.95,-75.164
23   Philadelphia   19115     40.093,-75.041
24   Philadelphia   19118     40.072,-75.208
25        Phoenix   85001    33.704,-112.352
26        Phoenix   85004    33.451,-112.071
27        Phoenix   85015    33.507,-112.103
28        Phoenix   85019    33.512,-112.142
29        Phoenix   85027    33.699,-112.114
30    San Antonio   78006     29.852,-98.729
31    San Antonio   78109     29.502,-98.306
32    San Antonio   78206     29.438,-98.462
33    San Antonio   78214      29.363,-98.49
34    San Antonio   78073     29.227,-98.609
35      San Diego   91911    32.609,-117.061
36      San Diego   92037    32.839,-117.262
37      San Diego   92102    32.715,-117.125
38      San Diego   92107    32.741,-117.244
39      San Diego   92027     33.143,-117.03
40         Dallas   75001     32.961,-96.838
41         Dallas   75043     32.855,-96.602
42         Dallas   75063     32.917,-96.973
43         Dallas   75089     32.924,-96.547
44         Dallas   75126      32.745,-96.46
45       San Jose   94088    37.189,-121.705
46       San Jose   95002    37.427,-121.975
47       San Jose   95103    37.189,-121.705
48       San Jose   95115    37.189,-121.705
49       San Jose   94560    37.536,-122.034

With the JSON and XML data in dataframe format, I could more easily clean and reshape the data.

2. Cleaning the Data

After importing the data I worked through the data cleaning plan that I outlined above. This four step process involved the following steps:

  1. Missing Values
  2. Correct Data Types
  3. Consistent Data
  4. Accurate Data

Each feature was investigated and cleaned for missing values, correct data types, and consistent data. The only data that needed to be checked for accuracy was the “zip_code” data. You can find the full data cleaning notebook for each feature on my github page *Accurate Data.

As an example of my data cleaning process I’ll go through detecting and cleaning missing values for the “purchase” column.

After importing the JSON data in to a pandas dataframe, I summed up the amount of missing values.

# summing the missing values
print(sum(df["purchase"].isna()))
14

Taking a look I could see that there was 14 missing values detected by Pandas. These are default missing values that Pandas detected right away. I was also curious in looking for other potential missing values, so I used the “unique()” method.

# looking for non-standard missing values
print(df["purchase"].unique())
[None 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower' 'grill'
 'household cleaner' 'camera' 'snow shovel' 'shoes' 'blender' 'shirts'
 'toaster' 'detergent' 'tv' 'n/a' 'tools' 'slow cooker' 'pants' 'audio'
 'microwave' 'food processor' 'soap' '--' 'jackets' 'cell phone' 'N/A'
 'paper products' 'flower pot']

There’s three unique values that look like missing values. Pandas did not detect these. They are the following:

  • ‘n/a’
  • ’–’
  • ‘N/A’

To clean these values I put them in to an array called “missing_values”, and then looped through the “purchase” column to look for those values, and then change them to a new class called “unavailable”.

missing_values = ["n/a", "--", "N/A"]

cnt = 0
for i in df["purchase"]:
    if i in missing_values:
        df.loc[cnt, "purchase"] = "unavailable"
    cnt+=1

print(df["purchase"].unique())
[None 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower' 'grill'
 'household cleaner' 'camera' 'snow shovel' 'shoes' 'blender' 'shirts'
 'toaster' 'detergent' 'tv' 'unavailable' 'tools' 'slow cooker' 'pants'
 'audio' 'microwave' 'food processor' 'soap' 'jackets' 'cell phone'
 'paper products' 'flower pot']

Taking a look, all three of the non-standard missing values were replaced with the new class label, “unavailable”.

Finally, I finished up by replacing the standard missing value types using the “fillna()” method.

df["purchase"].fillna("unavailable", inplace=True)
print(sum(df["purchase"].isna()))
print(df["purchase"].unique())
0
['unavailable' 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower'
 'grill' 'household cleaner' 'camera' 'snow shovel' 'shoes' 'blender'
 'shirts' 'toaster' 'detergent' 'tv' 'tools' 'slow cooker' 'pants' 'audio'
 'microwave' 'food processor' 'soap' 'jackets' 'cell phone'
 'paper products' 'flower pot']

Taking a final look at the “purchase” column using the “sum()” and “unique()” methods confirms that all missing values have been detected and changed to “unavailable”.

Sometimes missing data can provide additional information, so rather than throwing it out I decided to label all of it and keep it. I would consult with my team and the marketing group before disposing of it. The missing data could convey valuable information about customer experience, so it’s better to keep that data for now.

For the full data cleaning notebook, please refer to my github page [provide link to your github page here].

3. Reshaping the Data

The final part is reshaping the data. Talk about how you used “pivot_table” to display the data in the correct format. This is the format requested by the hypothetical sales department in this project.

After importing and cleaning the data, the final step was to reshape the data.

During the cleaning portion the cleaned dataset was exported as a clean CSV, so I’m going to import and reshape that clean dataset.

I’ll start by importing the clean dataset, “customer_data_cleaned.csv”.

df = pd.read_csv("customer_data_cleaned.csv")
df.head()
Unnamed: 0 amount category city customer_id date frequently_bought_together lat_lon purchase related_items state zip_code
0 0 24.64 household Chicago 100191 2014-01-01 towels 41.86,-87.619 NaN towels IL 60605
1 1 35.00 clothing Dallas 100199 2014-01-02 sandals 32.924,-96.547 shorts belts TX 75089
2 2 89.72 outdoor Philadelphia 100170 2014-01-03 lawn bags 40.002,-75.118 lawn_mower shovels PA 19019
3 3 51.32 electronics Chicago 100124 2014-01-04 headphones 41.88,-87.63 laptop headphones IL 60603
4 4 81.75 outdoor Philadelphia 100173 2014-01-05 sponge 39.953,-75.166 car wash sponge PA 19102

The marketing department is interested in lookin at how much each customer spends in each category.

To make it easier for them to look at this data, I’ll subset my full dataframe, and reshape it using the “pivot_table” method. Since they’re probably interested in total spending in each category, I’ll use the “sum” method from Numpy as my aggregation function.

# pivot table; aggregation function "sum"
import numpy as np
df_subset = df[["customer_id", "category", "amount"]]
#print(df_subset)

df_pivot = df_subset.pivot_table(index="customer_id", columns="category", values="amount", aggfunc=np.sum)
print(df_pivot)
category     appliances  clothing  electronics  household  outdoor
customer_id                                                       
100100            98.25       NaN          NaN        NaN      NaN
100101            51.46       NaN          NaN        NaN      NaN
100102              NaN       NaN          NaN      70.66      NaN
100103            86.52     46.02        78.61        NaN      NaN
100105              NaN       NaN          NaN        NaN    50.71
100106            26.48       NaN       183.88        NaN      NaN
100107              NaN       NaN          NaN      90.84      NaN
100108              NaN     90.84          NaN        NaN      NaN
100109              NaN       NaN          NaN        NaN    31.79
100111            82.12       NaN          NaN        NaN    77.28
100115              NaN     99.06          NaN        NaN      NaN
100116              NaN       NaN          NaN        NaN   166.81
100118            30.55     82.35          NaN        NaN      NaN
100119              NaN       NaN          NaN        NaN    26.66
100120              NaN     86.29        41.18        NaN      NaN
100123            34.57     89.68          NaN        NaN      NaN
100124            53.94       NaN        89.93      58.73      NaN
100128              NaN       NaN          NaN      94.16      NaN
100129              NaN     64.70          NaN        NaN      NaN
100130            25.83       NaN          NaN        NaN      NaN
100131              NaN       NaN          NaN        NaN    49.41
100132              NaN       NaN       142.65        NaN      NaN
100133              NaN     23.69          NaN        NaN    49.57
100136              NaN     79.43          NaN        NaN      NaN
100138              NaN       NaN          NaN      28.74      NaN
100140            85.23     53.09          NaN      85.92      NaN
100142              NaN       NaN          NaN      91.23    59.02
100143              NaN     52.99          NaN        NaN      NaN
100145            74.52       NaN          NaN        NaN      NaN
100148              NaN       NaN          NaN      35.03      NaN
...                 ...       ...          ...        ...      ...
100153              NaN     96.87        87.08        NaN      NaN
100154              NaN       NaN        21.32        NaN      NaN
100156              NaN       NaN        37.32        NaN      NaN
100158              NaN       NaN          NaN        NaN   171.74
100159              NaN     91.98          NaN        NaN      NaN
100160              NaN       NaN        89.36        NaN      NaN
100162            70.18     76.08          NaN      69.28      NaN
100165              NaN       NaN          NaN     142.68      NaN
100166              NaN       NaN        44.31        NaN      NaN
100167            91.86       NaN          NaN      76.69      NaN
100168              NaN     58.43          NaN        NaN      NaN
100170              NaN       NaN          NaN        NaN    89.72
100173              NaN       NaN          NaN        NaN    81.75
100180              NaN     48.03          NaN        NaN      NaN
100182            64.86       NaN          NaN        NaN      NaN
100183              NaN     66.56          NaN        NaN      NaN
100185            75.00     78.81        54.64        NaN      NaN
100186            63.54       NaN        97.60        NaN      NaN
100187              NaN       NaN        65.79        NaN    90.37
100188              NaN       NaN          NaN      53.80    95.85
100189            90.89       NaN          NaN        NaN      NaN
100190              NaN       NaN        61.33        NaN      NaN
100191              NaN    159.16          NaN      24.64      NaN
100192            85.00     78.11        80.27        NaN      NaN
100193            97.59       NaN          NaN        NaN      NaN
100194              NaN     77.83        67.20        NaN      NaN
100195              NaN       NaN          NaN        NaN    81.17
100196              NaN     27.37          NaN        NaN      NaN
100198              NaN     76.43          NaN        NaN      NaN
100199              NaN     61.81          NaN        NaN      NaN

[64 rows x 5 columns]

Taking a look, we now have a more neatly shaped dataframe for the marketing department. The total amount that each customer has spent in each category is shown. A lot of customers didn’t spend any money in certain categories, which is represented by “NaN”.

Conclusion

In this project I cleaned a messy JSON dataset. I also worked with XML data, and delivered my cleaned dataset in CSV format.

I worked through a data cleaning plan that involved investigating and cleaning the dataset for missing values, incorrect data types, inconsistent data, and data inaccuracies.

Finally, the data was reshaped to more concisely organize the data in the format requested by the marketing department. This reshaped format would make it easier for them to investigate spending trends of each customer based on the various categories.

It’s important to investigate and clean data because dirty data can lead to errors during exploratory analysis and machine learning.

Used Tools:

  • Totorial to build a Python Data Wrangling Project and build a new website
  • Git
  • GitHub
  • Atom
  • Jupyter Notebook
  • Python Coding
  • Anaconda
  • 04-2020/ Nick Scherer.