Data Wrangling Project
Created with: Git, Github, Atom, Anaconda, Jupyter Notebook and Python programming Language
What is Data Wrangling?
Data Wrangling is the process of converting and mapping data from its raw form to another format with the purpose of making it more valuable and appropriate for advance tasks such as Data Analytics and Machine Learning. The goals of data wrangling:
- Reveal a “deeper intelligence” within your data, by gathering data from multiple sources
- Provide accurate, actionable data in the hands of business analysts in a timely matter
- Reduce the time spent collecting and organizing unruly data before it can be utilized
- Enable data scientists and analysts to focus on the analysis of data, rather than the wrangling
- Drive better decision-making skills by senior leaders in an organization
- Data Scientist’s use most of their time (up to 80%) in Data Wrangling to prepare data for Machine learning.
Introduction
The purpose of this project was to clean a messy dataset for a marketing department at an ecommerce company. The marketing company analyzes the data in detail to provide business decisions. My task was to clean the data for them so that they would have a reliable dataset to analyze. The final deliverable to the marketing group was a clean dataset in tabular format that they could easily work with.
I worked with various data formats, including JSON and XML. To perform the data cleaning tasks I utilized Python, along with useful data cleaning libraries such as Pandas. All of my work was neatly documented and organized using Jupyter Notebooks.
Data Overview
In this project I worked with a JSON dataset and an XML dataset. The JSON dataset contained customer information with various fields. These fields included the following:
- customer_id
- date
- purchase
- category
- amount
- related_items
- frequently_bought_together
- city
- state
- zip_code
- lat_lon
The JSON dataset with the customer information is the messy dataset that needs to be cleaned.
An XML file with location data was provided by a different part of the data cleaning team. This file contains the following fields:
- City
- Zipcode
- Latitude_Longitude
The other members of the data cleaning team have assured me that this dataset is accurate. They also said that the “city”, “state”, and “lat_lon” fields are accurate in the JSON file, but the “zip_code” field is not. The “lat_lon” data in the JSON file will be cross-referenced with the location data in the XML file to correct innacurate zipcode data.
Data Cleaning Plan
To clean the data, I used a three step data cleaning plan that consisted of the following:
- Importing the data
- Cleaning the data
- Reshaping the data
To clean the data, I dealt with the following seven aspects of data wrangling:
- Consistent Data
- Missing Values
- Constant Strings
- Correct Types
- Importig XML file
- Pivot Table
- Accurate Data
Importing the Data
The first step was to import the data. The customer information data was in JSON format. I utilized the JSON library to take a look at the data.
import json
with open("customer_data_example.json") as f_in:
data = json.load(f_in)
print(json.dumps(data, indent=2))
[
{
"customer_id": 100191,
"date": "1-Jan-14",
"purchase": "soap",
"category": "household",
"amount": "24.64",
"related_items": "towels",
"frequently_bought_together": "towels",
"city": "Chicago",
"state": "IL",
"zip_code": 60605,
"lat_lon": "41.86,-87.619"
},
{
"customer_id": 100199,
"date": "2-Jan-14",
"purchase": "shorts",
"category": "clothing",
"amount": "35",
"related_items": "belts",
"frequently_bought_together": "sandals",
"city": "Dallas",
"state": "TX",
"zip_code": 75089,
"lat_lon": "32.924,-96.547"
},
{
"customer_id": 100170,
"date": "3-Jan-14",
"purchase": "lawn_mower",
"category": "outdoor",
"amount": "89.72",
"related_items": "shovels",
"frequently_bought_together": "lawn bags",
"city": "Philadelphia",
"state": "PA",
"zip_code": 19019,
"lat_lon": "40.002,-75.118"
},
{
"customer_id": 100124,
"date": "4-Jan-14",
"purchase": "laptop",
"category": "electronics",
"amount": "51.32",
"related_items": "headphones",
"frequently_bought_together": "headphones",
"city": "Chicago",
"state": "IL",
"zip_code": 60603,
"lat_lon": "41.88,-87.63"
},
{
"customer_id": 100173,
"date": "5-Jan-14",
"purchase": "car wash",
"category": "outdoor",
"amount": "81.75",
"related_items": "sponge",
"frequently_bought_together": "sponge",
"city": "Philadelphia",
"state": "PA",
"zip_code": 19102,
"lat_lon": "39.953,-75.166"
},
{
"customer_id": 100116,
"date": "6-Jan-14",
"purchase": "lawn mower",
"category": "outdoor",
"amount": "29.16",
"related_items": "rakes",
"frequently_bought_together": "fertilizer",
"city": "San Diego",
"state": "CA",
"zip_code": 92027,
"lat_lon": "33.143,-117.03"
},
{
"customer_id": 100105,
"date": "7-Jan-14",
"purchase": "grill",
"category": "outdoor",
"amount": "50.71",
"related_items": "grill cleaner",
"frequently_bought_together": "bbq sauce",
"city": "Dallas",
"state": "TX",
"zip_code": 75126,
"lat_lon": "32.745,-96.46"
},
{
"customer_id": 100148,
"date": "8-Jan-14",
"purchase": "household cleaner",
"category": "household",
"amount": "35.03",
"related_items": "spray bottles",
"frequently_bought_together": "spray bottles",
"city": "San Antonio",
"state": "TX",
"zip_code": 78109,
"lat_lon": "29.502,-98.306"
},
{
"customer_id": 100118,
"date": "9-Jan-14",
"purchase": "slow cooker",
"category": "appliances",
"amount": "30.55",
"related_items": "tupperware",
"frequently_bought_together": "pot holders",
"city": "Philadelphia",
"state": "PA",
"zip_code": 19102,
"lat_lon": "39.953,-75.166"
},
{
"customer_id": 100106,
"date": "10-Jan-14",
"purchase": "camera",
"category": "electronics",
"amount": "92.01",
"related_items": "lens cleaner",
"frequently_bought_together": "camera lens",
"city": "Dallas",
"state": "TX",
"zip_code": 75126,
"lat_lon": "32.917,-96.973"
},
{
"customer_id": 100109,
"date": "11-Jan-14",
"purchase": "snow shovel",
"category": "outdoor",
"amount": "31.79",
"related_items": "boots",
"frequently_bought_together": "gloves",
"city": "New York City",
"state": "NY",
"zip_code": 10004,
"lat_lon": "40.699,-74.041"
},
{
"customer_id": 100153,
"date": "12-Jan-14",
"purchase": "shoes",
"category": "clothing",
"amount": "96.87",
"related_items": "dress shoes",
"frequently_bought_together": "dress shoes",
"city": "Chicago",
"state": "IL",
"zip_code": 60605,
"lat_lon": "41.86,-87.619"
},
{
"customer_id": 100116,
"date": "13-Jan-14",
"purchase": "grill",
"category": "outdoor",
"amount": "41.91",
"related_items": "propane tank",
"frequently_bought_together": "bbq sauce",
"city": "San Diego",
"state": "CA",
"zip_code": 92027,
"lat_lon": "33.143,-117.03"
},
{
"customer_id": 100151,
"date": "14-Jan-14",
"purchase": "blender",
"category": "appliances",
"amount": "61.99",
"related_items": "pitcher",
"frequently_bought_together": "fruit",
"city": "San Antonio",
"state": "TX",
"zip_code": 78214,
"lat_lon": "29.363,-98.49"
},
{
"customer_id": 100191,
"date": "15-Jan-14",
"purchase": "shirts",
"category": "clothing",
"amount": "36.11",
"related_items": "t-shirt",
"frequently_bought_together": "t shirt",
"city": "Chicago",
"state": "IL",
"zip_code": 60605,
"lat_lon": "41.86,-87.619"
},
{
"customer_id": 100185,
"date": "16-Jan-14",
"purchase": "toaster",
"category": "appliances",
"amount": "75",
"related_items": "cream cheese",
"frequently_bought_together": "bagels",
"city": "San Antonio",
"state": "TX",
"zip_code": 78214,
"lat_lon": "29.363,-98.49"
},
{
"customer_id": 100153,
"date": "17-Jan-14",
"purchase": "laptop",
"category": "electronics",
"amount": "87.08",
"related_items": "charger",
"frequently_bought_together": "headphones",
"city": "Chicago",
"state": "IL",
"zip_code": 60605,
"lat_lon": "41.86,-87.619"
},
{
"customer_id": 100151,
"date": "18-Jan-14",
"purchase": "detergent",
"category": "household",
"amount": "41.34",
"related_items": "fabric softener",
"frequently_bought_together": "fabric softener",
"city": "San Antonio",
"state": "TX",
"zip_code": 78214,
"lat_lon": "29.363,-98.49"
},
{
"customer_id": 100196,
"date": "19-Jan-14",
"purchase": "shoes",
"category": "clothing",
"amount": "27.37",
"related_items": "dress shoes",
"frequently_bought_together": "shoe laces",
"city": "New York City",
"state": "NY",
"zip_code": 10004,
"lat_lon": "40.699,-74.041"
},
{
"customer_id": 100124,
"date": "20-Jan-14",
"purchase": "tv",
"category": "electronics",
"amount": "38.61",
"related_items": "surround sound",
"frequently_bought_together": "surround sound",
"city": "Chicago",
"state": "IL",
"zip_code": 60603,
"lat_lon": "41.88,-87.63"
},
{
"customer_id": 100188,
"date": "21-Jan-14",
"purchase": "paper products",
"category": "household",
"amount": "53.8",
"related_items": "toilet paper",
"frequently_bought_together": "paper towels",
"city": "New York City",
"state": "NY",
"zip_code": 10002,
"lat_lon": "40.717,-73.987"
},
{
"customer_id": 100192,
"date": "22-Jan-14",
"purchase": "toaster",
"category": "appliances",
"amount": "85",
"related_items": "bread",
"frequently_bought_together": "butter",
"city": "San Jose",
"state": "CA",
"zip_code": 94560,
"lat_lon": "37.536,-122.034"
},
{
"customer_id": 100140,
"date": "23-Jan-14",
"purchase": "tools",
"category": "house",
"amount": "85.92",
"related_items": "screws",
"frequently_bought_together": "batteries",
"city": "SD",
"state": "CA",
"zip_code": 92037,
"lat_lon": "32.839,-117.262"
},
{
"customer_id": 100159,
"date": "24-Jan-14",
"purchase": "shoes",
"category": "clothing",
"amount": "91.98",
"related_items": "dress shoes",
"frequently_bought_together": "sneakers",
"city": "San Jose",
"state": "CA",
"zip_code": 95115,
"lat_lon": "37.189,-121.705"
},
{
"customer_id": 100182,
"date": "25-Jan-14",
"purchase": "slow cooker",
"category": "appliances",
"amount": "64.86",
"related_items": "tongs",
"frequently_bought_together": "cookbook",
"city": "San Antonio",
"state": "TX",
"zip_code": 78006,
"lat_lon": "29.852,-98.729"
},
{
"customer_id": 100158,
"date": "26-Jan-14",
"purchase": "lawn mower",
"category": "outdoor",
"amount": "86.05",
"related_items": "gardening gloves",
"frequently_bought_together": "rakes",
"city": "Hou",
"state": "TX",
"zip_code": 77005,
"lat_lon": "29.718,-95.428"
},
{
"customer_id": 100185,
"date": "27-Jan-14",
"purchase": "pants",
"category": "clothing",
"amount": "78.81",
"related_items": "wallet",
"frequently_bought_together": "wallet",
"city": "San Antonio",
"state": "TX",
"zip_code": 78214,
"lat_lon": "29.363,-98.49"
},
{
"customer_id": 100103,
"date": "28-Jan-14",
"purchase": "audio",
"category": "electronics",
"amount": "78.61",
"related_items": "headphones",
"frequently_bought_together": "headphones",
"city": "Houston",
"state": "TX",
"zip_code": 77012,
"lat_lon": "29.72,-95.279"
},
{
"customer_id": 100150,
"date": "29-Jan-14",
"purchase": "shorts",
"category": "clothing",
"amount": "36.21",
"related_items": "t-shirts",
"frequently_bought_together": "sunglasses",
"city": "Philadelphia",
"state": "PA",
"zip_code": 19019,
"lat_lon": "40.002,-75.118"
},
{
"customer_id": 100183,
"date": "30-Jan-14",
"purchase": "shirts",
"category": "clothing",
"amount": "66.56",
"related_items": "shortsleeve",
"frequently_bought_together": "collared shirt",
"city": "Phoenix",
"state": "AZ",
"zip_code": 85004,
"lat_lon": "33.451,-112.071"
},
{
"customer_id": 100111,
"date": "31-Jan-14",
"purchase": "snow shovel",
"category": "outdoor",
"amount": "77.28",
"related_items": "mittens",
"frequently_bought_together": "gloves",
"city": "Phoenix",
"state": "AZ",
"zip_code": 85019,
"lat_lon": "33.512,-112.142"
},
{
"customer_id": 100160,
"date": "1-Feb-14",
"purchase": "tv",
"category": "electronics",
"amount": "89.36",
"related_items": "speakers",
"frequently_bought_together": "surround sound",
"city": "San Jose",
"state": "CA",
"zip_code": 94560,
"lat_lon": "37.536,-122.034"
},
{
"customer_id": 100133,
"date": "2-Feb-14",
"purchase": "shirts",
"category": "clothing",
"amount": "23.69",
"related_items": "collared shirt",
"frequently_bought_together": "dress shirt",
"city": "Los Angeles",
"state": "CA",
"zip_code": 90020,
"lat_lon": "34.066,-118.309"
},
{
"customer_id": 100158,
"date": "3-Feb-14",
"purchase": "snow shovel",
"category": "outdoor",
"amount": "85.69",
"related_items": "mittens",
"frequently_bought_together": "sand",
"city": "Houston",
"state": "TX",
"zip_code": 77005,
"lat_lon": "29.718,-95.428"
},
{
"customer_id": 100193,
"date": "4-Feb-14",
"purchase": "microwave",
"category": "appliances",
"amount": "97.59",
"related_items": "popcorn",
"frequently_bought_together": "egg cooker",
"city": "San Jose",
"state": "CA",
"zip_code": 95115,
"lat_lon": "37.189,-121.705"
},
{
"customer_id": 100167,
"date": "5-Feb-14",
"purchase": "food processor",
"category": "appliances",
"amount": "32.89",
"related_items": "vegetable peeler",
"frequently_bought_together": "vegetable peeler",
"city": "New York City",
"state": "NY",
"zip_code": 10004,
"lat_lon": "40.699,-74.041"
},
{
"customer_id": 100102,
"date": "6-Feb-14",
"purchase": "soap",
"category": "house",
"amount": "70.66",
"related_items": "shampoo",
"frequently_bought_together": "loofah",
"city": "Houston",
"state": "TX",
"zip_code": 77009,
"lat_lon": "29.793,-95.367"
},
{
"customer_id": 100167,
"date": "7-Feb-14",
"purchase": "soap",
"category": "house",
"amount": "76.69",
"related_items": "body wash",
"frequently_bought_together": "loofah",
"city": "New York City",
"state": "NY",
"zip_code": 10004,
"lat_lon": "40.699,-74.041"
},
{
"customer_id": 100185,
"date": "8-Feb-14",
"purchase": "camera",
"category": "electronics",
"amount": "54.64",
"related_items": "editing software",
"frequently_bought_together": "camera case",
"city": "San Antonio",
"state": "TX",
"zip_code": 78214,
"lat_lon": "29.363,-98.49"
},
{
"customer_id": 100140,
"date": "9-Feb-14",
"purchase": "food processor",
"category": "appliances",
"amount": "85.23",
"related_items": "vegetable peeler",
"frequently_bought_together": "vegetables",
"city": "San Diego",
"state": "CA",
"zip_code": 92037,
"lat_lon": "32.839,-117.262"
},
{
"customer_id": 100162,
"date": "10-Feb-14",
"purchase": "food processor",
"category": "appliances",
"amount": "70.18",
"related_items": "vegetables",
"frequently_bought_together": "tupperware",
"city": "San Jose",
"state": "CA",
"zip_code": 94560,
"lat_lon": "37.536,-122.034"
},
{
"customer_id": 100123,
"date": "11-Feb-14",
"purchase": "slow cooker",
"category": "appliances",
"amount": "34.57",
"related_items": "cookbook",
"frequently_bought_together": "serving spoon",
"city": "Los Angeles",
"state": "CA",
"zip_code": 90020,
"lat_lon": "34.066,-118.309"
},
{
"customer_id": 100186,
"date": "12-Feb-14",
"purchase": "slow cooker",
"category": "appliances",
"amount": "63.54",
"related_items": "tongs",
"frequently_bought_together": "tupperware",
"city": "Houston",
"state": "TX",
"zip_code": 77009,
"lat_lon": "29.793,-95.367"
},
{
"customer_id": 100198,
"date": "13-Feb-14",
"purchase": "pants",
"category": "clothing",
"amount": "76.43",
"related_items": "khakis ",
"frequently_bought_together": "wallet",
"city": "San Antonio",
"state": "TX",
"zip_code": 78206,
"lat_lon": "29.438,-98.462"
},
{
"customer_id": 100188,
"date": "14-Feb-14",
"purchase": "snow shovel",
"category": "outdoor",
"amount": "95.85",
"related_items": "mittens",
"frequently_bought_together": "sand",
"city": "New York City",
"state": "NY",
"zip_code": 10002,
"lat_lon": "40.717,-73.987"
},
{
"customer_id": 100136,
"date": "15-Feb-14",
"purchase": "pants",
"category": "clothing",
"amount": "79.43",
"related_items": "khakis ",
"frequently_bought_together": "wallet",
"city": "Chicago",
"state": "IL",
"zip_code": 60602,
"lat_lon": "41.883,-87.629"
},
{
"customer_id": 100162,
"date": "16-Feb-14",
"purchase": "shirts",
"category": "clothing",
"amount": "76.08",
"related_items": "t-shirt",
"frequently_bought_together": "dress shirt",
"city": "San Jose",
"state": "CA",
"zip_code": 94560,
"lat_lon": "37.536,-122.034"
},
{
"customer_id": 100191,
"date": "17-Feb-14",
"purchase": "shirts",
"category": "clothing",
"amount": "57.33",
"related_items": "shortsleeve",
"frequently_bought_together": "button down shirt",
"city": "Chicago",
"state": "IL",
"zip_code": 60605,
"lat_lon": "41.86,-87.619"
},
{
"customer_id": 100120,
"date": "18-Feb-14",
"purchase": "jackets",
"category": "clothing",
"amount": "86.29",
"related_items": "scarfs",
"frequently_bought_together": "winter gloves",
"city": "Dallas",
"state": "TX",
"zip_code": 75001,
"lat_lon": "32.961,-96.838"
},
{
"customer_id": 100106,
"date": "19-Feb-14",
"purchase": "cell phone",
"category": "electronics",
"amount": "91.87",
"related_items": "screen cleaner",
"frequently_bought_together": "cell phone case",
"city": "Dallas",
"state": "TX",
"zip_code": 75063,
"lat_lon": "32.917,-96.973"
},
{
"customer_id": 100192,
"date": "20-Feb-14",
"purchase": "pants",
"category": "clothing",
"amount": "78.11",
"related_items": "wallet",
"frequently_bought_together": "wallet",
"city": "San Jose",
"state": "CA",
"zip_code": 94560,
"lat_lon": "37.536,-122.034"
},
{
"customer_id": 100107,
"date": "21-Feb-14",
"purchase": "paper products",
"category": "household",
"amount": "90.84",
"related_items": "paper plates",
"frequently_bought_together": "paper towels",
"city": "San Diego",
"state": "CA",
"zip_code": 92102,
"lat_lon": "32.715,-117.125"
},
{
"customer_id": 100142,
"date": "22-Feb-14",
"purchase": "flower pot",
"category": "outdoor",
"amount": "59.02",
"related_items": "plant food",
"frequently_bought_together": "flower seeds",
"city": "Phoenix",
"state": "AZ",
"zip_code": 85001,
"lat_lon": "33.704,-112.352"
},
{
"customer_id": 100131,
"date": "23-Feb-14",
"purchase": "grill",
"category": "outdoor",
"amount": "49.41",
"related_items": "grill cleaner",
"frequently_bought_together": "steak seasoning",
"city": "San Antonio",
"state": "TX",
"zip_code": 78206,
"lat_lon": "29.438,-98.462"
},
{
"customer_id": 100115,
"date": "24-Feb-14",
"purchase": "pants",
"category": "clothing",
"amount": "99.06",
"related_items": "slacks",
"frequently_bought_together": "khakis ",
"city": "New York City",
"state": "NY",
"zip_code": 10013,
"lat_lon": "40.721,-74.005"
},
{
"customer_id": 100194,
"date": "25-Feb-14",
"purchase": "shirts",
"category": "clothing",
"amount": "77.83",
"related_items": "button down shirt",
"frequently_bought_together": "dress shirt",
"city": "Philadelphia",
"state": "PA",
"zip_code": 19115,
"lat_lon": "40.093,-75.041"
},
{
"customer_id": 100133,
"date": "26-Feb-14",
"purchase": "car wash",
"category": "outdoor",
"amount": "49.57",
"related_items": "wax",
"frequently_bought_together": "tire cleaner",
"city": "Los Angeles",
"state": "CA",
"zip_code": 90020,
"lat_lon": "34.066,-118.309"
},
{
"customer_id": 100108,
"date": "27-Feb-14",
"purchase": "pants",
"category": "clothing",
"amount": "90.84",
"related_items": "slacks",
"frequently_bought_together": "wallet",
"city": "San Diego",
"state": "CA",
"zip_code": 91911,
"lat_lon": "32.609,-117.061"
},
{
"customer_id": 100120,
"date": "28-Feb-14",
"purchase": "tv",
"category": "electronics",
"amount": "41.18",
"related_items": "dvd player",
"frequently_bought_together": "video game console",
"city": "Dallas",
"state": "TX",
"zip_code": 75001,
"lat_lon": "32.961,-96.838"
},
{
"customer_id": 100191,
"date": "1-Mar-14",
"purchase": "jackets",
"category": "clothing",
"amount": "65.72",
"related_items": "scarfs",
"frequently_bought_together": "winter gloves",
"city": "Chicago",
"state": "IL",
"zip_code": 60605,
"lat_lon": "41.86,-87.619"
},
{
"customer_id": 100189,
"date": "2-Mar-14",
"purchase": "slow cooker",
"category": "appliances",
"amount": "90.89",
"related_items": "serving spoon",
"frequently_bought_together": "tongs",
"city": "Dallas",
"state": "TX",
"zip_code": 75001,
"lat_lon": "32.961,-96.838"
},
{
"customer_id": 100192,
"date": "3-Mar-14",
"purchase": "laptop",
"category": "electronics",
"amount": "46.34",
"related_items": "charger",
"frequently_bought_together": "wireless mouse",
"city": "San Jose",
"state": "CA",
"zip_code": 94560,
"lat_lon": "37.536,-122.034"
},
{
"customer_id": 100132,
"date": "4-Mar-14",
"purchase": null,
"category": "electronics",
"amount": "97.54",
"related_items": "subwoofer",
"frequently_bought_together": "headphones",
"city": "New York City",
"state": "NY",
"zip_code": 10013,
"lat_lon": "40.721,-74.005"
},
{
"customer_id": 100143,
"date": "5-Mar-14",
"purchase": null,
"category": "clothing",
"amount": "52.99",
"related_items": "t-shirt",
"frequently_bought_together": "shortsleeve",
"city": "Phoenix",
"state": "AZ",
"zip_code": 85015,
"lat_lon": "33.507,-112.103"
},
{
"customer_id": 100101,
"date": "6-Mar-14",
"purchase": null,
"category": "appliances",
"amount": "51.46",
"related_items": "mixing spoon",
"frequently_bought_together": "vegetable peeler",
"city": "Los Angeles",
"state": "CA",
"zip_code": 75001,
"lat_lon": "34.09,-118.295"
},
{
"customer_id": 100129,
"date": "7-Mar-14",
"purchase": null,
"category": "clothing",
"amount": "64.7",
"related_items": "belts",
"frequently_bought_together": "wallets",
"city": "Dallas",
"state": "TX",
"zip_code": 75126,
"lat_lon": "32.745,-96.46"
},
{
"customer_id": 100100,
"date": "8-Mar-14",
"purchase": null,
"category": "appliances",
"amount": "98.25",
"related_items": "bowls",
"frequently_bought_together": "forks",
"city": "Chicago",
"state": "IL",
"zip_code": 60610,
"lat_lon": "41.899,-87.637"
},
{
"customer_id": 100154,
"date": "9-Mar-14",
"purchase": "cell phone",
"category": "electronics",
"amount": "21.32",
"related_items": "charger",
"frequently_bought_together": "cell phone case",
"city": "San Antonio",
"state": "TX",
"zip_code": 78206,
"lat_lon": "29.438,-98.462"
},
{
"customer_id": 100168,
"date": "10-Mar-14",
"purchase": "jackets",
"category": "clothing",
"amount": "58.43",
"related_items": "winter gloves",
"frequently_bought_together": "winter gloves",
"city": "San Antonio",
"state": "TX",
"zip_code": 78109,
"lat_lon": "29.502,-98.306"
},
{
"customer_id": 100138,
"date": "11-Mar-14",
"purchase": "soap",
"category": "household",
"amount": "28.74",
"related_items": "shampoo",
"frequently_bought_together": "shampoo",
"city": "San Diego",
"state": "CA",
"zip_code": 91911,
"lat_lon": "32.609,-117.061"
},
{
"customer_id": 100165,
"date": "12-Mar-14",
"purchase": "tools",
"category": "household",
"amount": "61.12",
"related_items": "drills ",
"frequently_bought_together": "drill bits",
"city": "San Diego",
"state": "CA",
"zip_code": 91911,
"lat_lon": "32.609,-117.061"
},
{
"customer_id": 100111,
"date": "13-Mar-14",
"purchase": "slow cooker",
"category": "appliances",
"amount": "82.12",
"related_items": "tupperware",
"frequently_bought_together": "serving spoon",
"city": "Phoenix",
"state": "AZ",
"zip_code": 85019,
"lat_lon": "33.512,-112.142"
},
{
"customer_id": 100124,
"date": "14-Mar-14",
"purchase": "soap",
"category": "household",
"amount": "58.73",
"related_items": "body wash",
"frequently_bought_together": "bar soap",
"city": "Chicago",
"state": "IL",
"zip_code": 60603,
"lat_lon": "41.88,-87.63"
},
{
"customer_id": 100119,
"date": "15-Mar-14",
"purchase": "flower pot",
"category": "outdoor",
"amount": "26.66",
"related_items": "plant food",
"frequently_bought_together": "tomato stakes",
"city": "San Antonio",
"state": "TX",
"zip_code": 78073,
"lat_lon": "29.227,-98.609"
},
{
"customer_id": 100116,
"date": "16-Mar-14",
"purchase": "car wash",
"category": "outdoor",
"amount": "72.4",
"related_items": "sponge",
"frequently_bought_together": "sponge",
"city": "San Diego",
"state": "CA",
"zip_code": 92027,
"lat_lon": "33.143,-117.03"
},
{
"customer_id": 100132,
"date": "17-Mar-14",
"purchase": "cell phone",
"category": "electronics",
"amount": "45.11",
"related_items": "cell phone case",
"frequently_bought_together": "cell phone case",
"city": "New York City",
"state": "NY",
"zip_code": 10013,
"lat_lon": "40.721,-74.005"
},
{
"customer_id": 100106,
"date": "18-Mar-14",
"purchase": "microwave",
"category": "appliances",
"amount": "26.48",
"related_items": "popcorn",
"frequently_bought_together": "forks",
"city": "Dallas",
"state": "TX",
"zip_code": 75063,
"lat_lon": "32.917,-96.973"
},
{
"customer_id": 100103,
"date": "19-Mar-14",
"purchase": "jackets",
"category": "clothing",
"amount": "46.02",
"related_items": "scarfs",
"frequently_bought_together": "scarfs",
"city": "Houston",
"state": "TX",
"zip_code": 77012,
"lat_lon": "29.72,-95.279"
},
{
"customer_id": 100192,
"date": "20-Mar-14",
"purchase": "laptop",
"category": "elect^ronics",
"amount": "33.93",
"related_items": "wireless mouse",
"frequently_bought_together": "headphones",
"city": "San Jose",
"state": "CA",
"zip_code": 94560,
"lat_lon": "37.536,-122.034"
},
{
"customer_id": 100180,
"date": "21-Mar-14",
"purchase": "jackets",
"category": "clothing",
"amount": "48.03",
"related_items": "mittens",
"frequently_bought_together": "winter gloves",
"city": "San Diego",
"state": "CA",
"zip_code": 92027,
"lat_lon": "33.143,-117.03"
},
{
"customer_id": 100165,
"date": "22-Mar-14",
"purchase": "detergent",
"category": "household",
"amount": "81.56",
"related_items": "dryer sheets",
"frequently_bought_together": "fabric softener",
"city": "San Diego",
"state": "CA",
"zip_code": 91911,
"lat_lon": "32.609,-117.061"
},
{
"customer_id": 100116,
"date": "23-Mar-14",
"purchase": "car wash",
"category": "outdoor",
"amount": "23.34",
"related_items": "towels",
"frequently_bought_together": "wax",
"city": "San Diego",
"state": "CA",
"zip_code": 92027,
"lat_lon": "33.143,-117.03"
},
{
"customer_id": 100149,
"date": "24-Mar-14",
"purchase": "cell phone",
"category": "electronics",
"amount": "70.36",
"related_items": "charger",
"frequently_bought_together": "cell phone case",
"city": "Houston",
"state": "TX",
"zip_code": 77009,
"lat_lon": "29.793,-95.367"
},
{
"customer_id": 100118,
"date": "25-Mar-14",
"purchase": "shirts",
"category": "clothing",
"amount": "82.35",
"related_items": "shortsleeve",
"frequently_bought_together": "shortsleeve",
"city": "Philadelphia",
"state": "PA",
"zip_code": 19102,
"lat_lon": "39.953,-75.166"
},
{
"customer_id": 100128,
"date": "26-Mar-14",
"purchase": "detergent",
"category": "household",
"amount": "94.16",
"related_items": "dryer sheets",
"frequently_bought_together": "bleach",
"city": "Phoenix",
"state": "AZ",
"zip_code": 60603,
"lat_lon": "33.507,-112.103"
},
{
"customer_id": 100142,
"date": "27-Mar-14",
"purchase": "detergent",
"category": "household",
"amount": "91.23",
"related_items": "bleach",
"frequently_bought_together": "fabric softener",
"city": "Phoenix",
"state": "AZ",
"zip_code": 85001,
"lat_lon": "33.704,-112.352"
},
{
"customer_id": 100195,
"date": "28-Mar-14",
"purchase": "car wash",
"category": "outdoor",
"amount": "81.17",
"related_items": "buckets",
"frequently_bought_together": "towels",
"city": "Philadelphia",
"state": "PA",
"zip_code": 19102,
"lat_lon": "39.953,-75.166"
},
{
"customer_id": 100123,
"date": "29-Mar-14",
"purchase": "shirts",
"category": "clothing",
"amount": "89.68",
"related_items": "collared shirt",
"frequently_bought_together": "shortsleeve",
"city": "Los Angeles",
"state": "CA",
"zip_code": 90020,
"lat_lon": "34.066,-118.309"
},
{
"customer_id": 100145,
"date": "30-Mar-14",
"purchase": "blender",
"category": "appliances",
"amount": "74.52",
"related_items": "pitcher",
"frequently_bought_together": "straws",
"city": "Chicago",
"state": "IL",
"zip_code": 60603,
"lat_lon": "41.88,-87.63"
},
{
"customer_id": 100199,
"date": "31-Mar-14",
"purchase": null,
"category": "clothing",
"amount": "26.81",
"related_items": "sunglasses",
"frequently_bought_together": "sunglasses",
"city": "Dallas",
"state": "TX",
"zip_code": 75089,
"lat_lon": "32.924,-96.547"
},
{
"customer_id": 100167,
"date": "1-Apr-14",
"purchase": null,
"category": "appliances",
"amount": "58.97",
"related_items": "cookbook",
"frequently_bought_together": "tupperware",
"city": "New York City",
"state": "NY",
"zip_code": 10004,
"lat_lon": "40.699,-74.041"
},
{
"customer_id": 100187,
"date": "2-Apr-14",
"purchase": null,
"category": "outdoor",
"amount": "90.37",
"related_items": "rakes",
"frequently_bought_together": "gardening gloves",
"city": "Chicago",
"state": "IL",
"zip_code": 91911,
"lat_lon": "41.905,-87.625"
},
{
"customer_id": 100156,
"date": "3-Apr-14",
"purchase": null,
"category": "electronics",
"amount": "37.32",
"related_items": "charger",
"frequently_bought_together": "headphones",
"city": "San Diego",
"state": "CA",
"zip_code": 91911,
"lat_lon": "32.609,-117.061"
},
{
"customer_id": 100166,
"date": "4-Apr-14",
"purchase": null,
"category": "electronics",
"amount": "44.31",
"related_items": "surround sound",
"frequently_bought_together": "dvd player",
"city": "Philadelphia",
"state": "PA",
"zip_code": 19115,
"lat_lon": "40.093,-75.041"
},
{
"customer_id": 100124,
"date": "5-Apr-14",
"purchase": null,
"category": "appliances",
"amount": "53.94",
"related_items": "cookbook",
"frequently_bought_together": "tongs",
"city": "Chicago",
"state": "IL",
"zip_code": 60603,
"lat_lon": "41.88,-87.63"
},
{
"customer_id": 100162,
"date": "6-Apr-14",
"purchase": "paper products",
"category": "household",
"amount": "69.28",
"related_items": "paper towels",
"frequently_bought_together": "toilet paper",
"city": "San Jose",
"state": "CA",
"zip_code": 94560,
"lat_lon": "37.536,-122.034"
},
{
"customer_id": 100187,
"date": "7-Apr-14",
"purchase": "audio",
"category": "electronics",
"amount": "65.79",
"related_items": "earbuds",
"frequently_bought_together": "headphones",
"city": "Chicago",
"state": "IL",
"zip_code": 60611,
"lat_lon": "41.905,-87.625"
},
{
"customer_id": 100186,
"date": "8-Apr-14",
"purchase": "laptop",
"category": "electronics",
"amount": "97.6",
"related_items": "charger",
"frequently_bought_together": "charger",
"city": "Houston",
"state": "TX",
"zip_code": 77009,
"lat_lon": "29.793,-95.367"
},
{
"customer_id": 100190,
"date": "9-April-2014",
"purchase": "tv",
"category": "electronics",
"amount": "61.33",
"related_items": "speakers",
"frequently_bought_together": "dvd player",
"city": "Dallas",
"state": "TX",
"zip_code": 75089,
"lat_lon": "32.924,-96.547"
},
{
"customer_id": 100152,
"date": "10-April-2014",
"purchase": "tools",
"category": "household",
"amount": "90.48",
"related_items": "hammers",
"frequently_bought_together": "drills ",
"city": "Dallas",
"state": "TX",
"zip_code": 75126,
"lat_lon": "32.745,-96.46"
},
{
"customer_id": 100103,
"date": "11-April-2014",
"purchase": "slow cooker",
"category": "appliances",
"amount": "86.52",
"related_items": "pot holders",
"frequently_bought_together": "cookbook",
"city": "Houston",
"state": "TX",
"zip_code": 77012,
"lat_lon": "29.72,-95.279"
},
{
"customer_id": 100194,
"date": "12-April-2014",
"purchase": "camera",
"category": "electronics",
"amount": "67.2",
"related_items": "editing software",
"frequently_bought_together": "lens cleaner",
"city": "Philadelphia",
"state": "PA",
"zip_code": 19115,
"lat_lon": "40.093,-75.041"
},
{
"customer_id": 100130,
"date": "13-April-2014",
"purchase": "blender",
"category": "appliances",
"amount": "25.83",
"related_items": "pitcher",
"frequently_bought_together": "fruit",
"city": "San Diego",
"state": "CA",
"zip_code": 92107,
"lat_lon": "32.741,-117.244"
},
{
"customer_id": 100152,
"date": "14-April-2014",
"purchase": "car wash",
"category": "outdoor",
"amount": "50.12",
"related_items": "tire cleaner",
"frequently_bought_together": "wax",
"city": "Dallas",
"state": "TX",
"zip_code": 75126,
"lat_lon": "32.745,-96.46"
},
{
"customer_id": 100140,
"date": "15-April-2014",
"purchase": "shirts",
"category": "clothing",
"amount": "53.09",
"related_items": "t-shirt",
"frequently_bought_together": "button down shirt",
"city": "San Diego",
"state": "CA",
"zip_code": 92037,
"lat_lon": "32.839,-117.262"
}
]
After getting a feel for the data and the different fields, I utilized the Pandas library to import the data into a Pandas dataframe. This is a table-like format in Pandas that makes the data easier to work with for cleaning and reshaping purposes.
import pandas as pd
df = pd.read_json("customer_data_example.json")
df
amount | category | city | customer_id | date | frequently_bought_together | lat_lon | purchase | related_items | state | zip_code | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24.64 | household | Chicago | 100191 | 2014-01-01 | towels | 41.86,-87.619 | None | towels | IL | 60605 |
1 | 35.00 | clothing | Dallas | 100199 | 2014-01-02 | sandals | 32.924,-96.547 | shorts | belts | TX | 75089 |
2 | 89.72 | outdoor | Philadelphia | 100170 | 2014-01-03 | lawn bags | 40.002,-75.118 | lawn_mower | shovels | PA | 19019 |
3 | 51.32 | electronics | Chicago | 100124 | 2014-01-04 | headphones | 41.88,-87.63 | laptop | headphones | IL | 60603 |
4 | 81.75 | outdoor | Philadelphia | 100173 | 2014-01-05 | sponge | 39.953,-75.166 | car wash | sponge | PA | 19102 |
5 | 29.16 | outdoor | San Diego | 100116 | 2014-01-06 | fertilizer | 33.143,-117.03 | lawn mower | rakes | CA | 92027 |
6 | 50.71 | outdoor | Dallas | 100105 | 2014-01-07 | bbq sauce | 32.745,-96.46 | grill | grill cleaner | TX | 75126 |
7 | 35.03 | household | San Antonio | 100148 | 2014-01-08 | spray bottles | 29.502,-98.306 | household cleaner | spray bottles | TX | 78109 |
8 | 30.55 | appliances | Philadelphia | 100118 | 2014-01-09 | pot holders | 39.953,-75.166 | None | tupperware | PA | 19102 |
9 | 92.01 | electronics | Dallas | 100106 | 2014-01-10 | camera lens | 32.917,-96.973 | camera | lens cleaner | TX | 75126 |
10 | 31.79 | outdoor | New York City | 100109 | 2014-01-11 | gloves | 40.699,-74.041 | snow shovel | boots | NY | 10004 |
11 | 96.87 | clothing | Chicago | 100153 | 2014-01-12 | dress shoes | 41.86,-87.619 | shoes | dress shoes | IL | 60605 |
12 | 41.91 | outdoor | San Diego | 100116 | 2014-01-13 | bbq sauce | 33.143,-117.03 | grill | propane tank | CA | 92027 |
13 | 61.99 | appliances | San Antonio | 100151 | 2014-01-14 | fruit | 29.363,-98.49 | blender | pitcher | TX | 78214 |
14 | 36.11 | clothing | Chicago | 100191 | 2014-01-15 | t shirt | 41.86,-87.619 | shirts | t-shirt | IL | 60605 |
15 | 75.00 | appliances | San Antonio | 100185 | 2014-01-16 | bagels | 29.363,-98.49 | toaster | cream cheese | TX | 78214 |
16 | 87.08 | electronics | Chicago | 100153 | 2014-01-17 | headphones | 41.86,-87.619 | laptop | charger | IL | 60605 |
17 | 41.34 | household | San Antonio | 100151 | 2014-01-18 | fabric softener | 29.363,-98.49 | detergent | fabric softener | TX | 78214 |
18 | 27.37 | clothing | New York City | 100196 | 2014-01-19 | shoe laces | 40.699,-74.041 | shoes | dress shoes | NY | 10004 |
19 | 38.61 | electronics | Chicago | 100124 | 2014-01-20 | surround sound | 41.88,-87.63 | tv | surround sound | IL | 60603 |
20 | 53.80 | household | New York City | 100188 | 2014-01-21 | paper towels | 40.717,-73.987 | n/a | toilet paper | NY | 10002 |
21 | 85.00 | appliances | San Jose | 100192 | 2014-01-22 | butter | 37.536,-122.034 | toaster | bread | CA | 94560 |
22 | 85.92 | house | SD | 100140 | 2014-01-23 | batteries | 32.839,-117.262 | tools | screws | CA | 92037 |
23 | 91.98 | clothing | San Jose | 100159 | 2014-01-24 | sneakers | 37.189,-121.705 | shoes | dress shoes | CA | 95115 |
24 | 64.86 | appliances | San Antonio | 100182 | 2014-01-25 | cookbook | 29.852,-98.729 | slow cooker | tongs | TX | 78006 |
25 | 86.05 | outdoor | Hou | 100158 | 2014-01-26 | rakes | 29.718,-95.428 | lawn mower | gardening gloves | TX | 77005 |
26 | 78.81 | clothing | San Antonio | 100185 | 2014-01-27 | wallet | 29.363,-98.49 | pants | wallet | TX | 78214 |
27 | 78.61 | electronics | Houston | 100103 | 2014-01-28 | headphones | 29.72,-95.279 | audio | headphones | TX | 77012 |
28 | 36.21 | clothing | Philadelphia | 100150 | 2014-01-29 | sunglasses | 40.002,-75.118 | shorts | t-shirts | PA | 19019 |
29 | 66.56 | clothing | Phoenix | 100183 | 2014-01-30 | collared shirt | 33.451,-112.071 | shirts | shortsleeve | AZ | 85004 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
75 | 45.11 | electronics | New York City | 100132 | 2014-03-17 | cell phone case | 40.721,-74.005 | cell phone | cell phone case | NY | 10013 |
76 | 26.48 | appliances | Dallas | 100106 | 2014-03-18 | forks | 32.917,-96.973 | microwave | popcorn | TX | 75063 |
77 | 46.02 | clothing | Houston | 100103 | 2014-03-19 | scarfs | 29.72,-95.279 | jackets | scarfs | TX | 77012 |
78 | 33.93 | elect^ronics | San Jose | 100192 | 2014-03-20 | headphones | 37.536,-122.034 | laptop | wireless mouse | CA | 94560 |
79 | 48.03 | clothing | San Diego | 100180 | 2014-03-21 | winter gloves | 33.143,-117.03 | jackets | mittens | CA | 92027 |
80 | 81.56 | household | San Diego | 100165 | 2014-03-22 | fabric softener | 32.609,-117.061 | detergent | dryer sheets | CA | 91911 |
81 | 23.34 | outdoor | San Diego | 100116 | 2014-03-23 | wax | 33.143,-117.03 | car wash | towels | CA | 92027 |
82 | 70.36 | electronics | Houston | 100149 | 2014-03-24 | cell phone case | 29.793,-95.367 | cell phone | charger | TX | 77009 |
83 | 82.35 | clothing | Philadelphia | 100118 | 2014-03-25 | shortsleeve | 39.953,-75.166 | shirts | shortsleeve | PA | 19102 |
84 | 94.16 | household | Phoenix | 100128 | 2014-03-26 | bleach | 33.507,-112.103 | detergent | dryer sheets | AZ | 60603 |
85 | 91.23 | household | Phoenix | 100142 | 2014-03-27 | fabric softener | 33.704,-112.352 | detergent | bleach | AZ | 85001 |
86 | 81.17 | outdoor | Philadelphia | 100195 | 2014-03-28 | towels | 39.953,-75.166 | car wash | buckets | PA | 19102 |
87 | 89.68 | clothing | Los Angeles | 100123 | 2014-03-29 | shortsleeve | 34.066,-118.309 | shirts | collared shirt | CA | 90020 |
88 | 74.52 | appliances | Chicago | 100145 | 2014-03-30 | straws | 41.88,-87.63 | blender | pitcher | IL | 60603 |
89 | 26.81 | clothing | Dallas | 100199 | 2014-03-31 | sunglasses | 32.924,-96.547 | None | sunglasses | TX | 75089 |
90 | 58.97 | appliances | New York City | 100167 | 2014-04-01 | tupperware | 40.699,-74.041 | None | cookbook | NY | 10004 |
91 | 90.37 | outdoor | Chicago | 100187 | 2014-04-02 | gardening gloves | 41.905,-87.625 | None | rakes | IL | 91911 |
92 | 37.32 | electronics | San Diego | 100156 | 2014-04-03 | headphones | 32.609,-117.061 | None | charger | CA | 91911 |
93 | 44.31 | electronics | Philadelphia | 100166 | 2014-04-04 | dvd player | 40.093,-75.041 | None | surround sound | PA | 19115 |
94 | 53.94 | appliances | Chicago | 100124 | 2014-04-05 | tongs | 41.88,-87.63 | None | cookbook | IL | 60603 |
95 | 69.28 | household | San Jose | 100162 | 2014-04-06 | toilet paper | 37.536,-122.034 | paper products | paper towels | CA | 94560 |
96 | 65.79 | electronics | Chicago | 100187 | 2014-04-07 | headphones | 41.905,-87.625 | audio | earbuds | IL | 60611 |
97 | 97.60 | electronics | Houston | 100186 | 2014-04-08 | charger | 29.793,-95.367 | laptop | charger | TX | 77009 |
98 | 61.33 | electronics | Dallas | 100190 | 2014-04-09 | dvd player | 32.924,-96.547 | tv | speakers | TX | 75089 |
99 | 90.48 | household | Dallas | 100152 | 2014-04-10 | drills | 32.745,-96.46 | tools | hammers | TX | 75126 |
100 | 86.52 | appliances | Houston | 100103 | 2014-04-11 | cookbook | 29.72,-95.279 | slow cooker | pot holders | TX | 77012 |
101 | 67.20 | electronics | Philadelphia | 100194 | 2014-04-12 | lens cleaner | 40.093,-75.041 | camera | editing software | PA | 19115 |
102 | 25.83 | appliances | San Diego | 100130 | 2014-04-13 | fruit | 32.741,-117.244 | blender | pitcher | CA | 92107 |
103 | 50.12 | outdoor | Dallas | 100152 | 2014-04-14 | wax | 32.745,-96.46 | car wash | tire cleaner | TX | 75126 |
104 | 53.09 | clothing | San Diego | 100140 | 2014-04-15 | button down shirt | 32.839,-117.262 | shirts | t-shirt | CA | 92037 |
105 rows × 11 columns
I also took a look at the data types for each field.
df.dtypes
amount float64
category object
city object
customer_id int64
date datetime64[ns]
frequently_bought_together object
lat_lon object
purchase object
related_items object
state object
zip_code int64
dtype: object
After importing the JSON data, I imported the XML data. I did this by first parsing the XML data to collect the tags and elements to store in a dictionary. The dictionary was then converted to a dataframe.
import xml.etree.ElementTree as ET
tree = ET.parse('location_data.xml')
root = tree.getroot()
# root tag
print(root.tag)
# child tag
print(root[0].tag)
# number of children elements
num_children = len(root.getchildren())
print(num_children)
# number of subchildren elements
num_subchildren = len(root[0].getchildren())
print(num_subchildren)
# pulling out all of the subchildren tags
tags = []
for subchild in root[0]:
tags.append(subchild.tag)
print(tags)
# creating an empty dictionary to store the data
d = {}
for tag in tags:
d[tag] = []
print(d)
# pulling out all of the data
for i in range(0, num_children):
for j in range(0, num_subchildren):
value = root[i][j].text
d[tags[j]].append(value)
#print(d)
# converting to a dataframe
df = pd.DataFrame(data=d)
print(df)
data-set
record
50
3
['City', 'Zipcode', 'Latitude_Longitude']
{'City': [], 'Zipcode': [], 'Latitude_Longitude': []}
City Zipcode Latitude_Longitude
0 New York City 10012 40.726,-73.998
1 New York City 10013 40.721,-74.005
2 New York City 10004 40.699,-74.041
3 New York City 10128 40.782,-73.95
4 New York City 10002 40.717,-73.987
5 Los Angeles 90001 33.973,-118.249
6 Los Angeles 90016 34.03,-118.353
7 Los Angeles 90008 34.01,-118.337
8 Los Angeles 90020 34.066,-118.309
9 Los Angeles 90029 34.09,-118.295
10 Chicago 60610 41.899,-87.637
11 Chicago 60611 41.905,-87.625
12 Chicago 60605 41.86,-87.619
13 Chicago 60602 41.883,-87.629
14 Chicago 60603 41.88,-87.63
15 Houston 77001 29.813,-95.31
16 Houston 77005 29.718,-95.428
17 Houston 77009 29.793,-95.367
18 Houston 77004 29.729,-95.366
19 Houston 77012 29.72,-95.279
20 Philadelphia 19019 40.002,-75.118
21 Philadelphia 19102 39.953,-75.166
22 Philadelphia 19110 39.95,-75.164
23 Philadelphia 19115 40.093,-75.041
24 Philadelphia 19118 40.072,-75.208
25 Phoenix 85001 33.704,-112.352
26 Phoenix 85004 33.451,-112.071
27 Phoenix 85015 33.507,-112.103
28 Phoenix 85019 33.512,-112.142
29 Phoenix 85027 33.699,-112.114
30 San Antonio 78006 29.852,-98.729
31 San Antonio 78109 29.502,-98.306
32 San Antonio 78206 29.438,-98.462
33 San Antonio 78214 29.363,-98.49
34 San Antonio 78073 29.227,-98.609
35 San Diego 91911 32.609,-117.061
36 San Diego 92037 32.839,-117.262
37 San Diego 92102 32.715,-117.125
38 San Diego 92107 32.741,-117.244
39 San Diego 92027 33.143,-117.03
40 Dallas 75001 32.961,-96.838
41 Dallas 75043 32.855,-96.602
42 Dallas 75063 32.917,-96.973
43 Dallas 75089 32.924,-96.547
44 Dallas 75126 32.745,-96.46
45 San Jose 94088 37.189,-121.705
46 San Jose 95002 37.427,-121.975
47 San Jose 95103 37.189,-121.705
48 San Jose 95115 37.189,-121.705
49 San Jose 94560 37.536,-122.034
With the JSON and XML data in dataframe format, I could more easily clean and reshape the data.
2. Cleaning the Data
After importing the data I worked through the data cleaning plan that I outlined above. This four step process involved the following steps:
- Missing Values
- Correct Data Types
- Consistent Data
- Accurate Data
Each feature was investigated and cleaned for missing values, correct data types, and consistent data. The only data that needed to be checked for accuracy was the “zip_code” data. You can find the full data cleaning notebook for each feature on my github page *Accurate Data.
As an example of my data cleaning process I’ll go through detecting and cleaning missing values for the “purchase” column.
After importing the JSON data in to a pandas dataframe, I summed up the amount of missing values.
# summing the missing values
print(sum(df["purchase"].isna()))
14
Taking a look I could see that there was 14 missing values detected by Pandas. These are default missing values that Pandas detected right away. I was also curious in looking for other potential missing values, so I used the “unique()” method.
# looking for non-standard missing values
print(df["purchase"].unique())
[None 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower' 'grill'
'household cleaner' 'camera' 'snow shovel' 'shoes' 'blender' 'shirts'
'toaster' 'detergent' 'tv' 'n/a' 'tools' 'slow cooker' 'pants' 'audio'
'microwave' 'food processor' 'soap' '--' 'jackets' 'cell phone' 'N/A'
'paper products' 'flower pot']
There’s three unique values that look like missing values. Pandas did not detect these. They are the following:
- ‘n/a’
- ’–’
- ‘N/A’
To clean these values I put them in to an array called “missing_values”, and then looped through the “purchase” column to look for those values, and then change them to a new class called “unavailable”.
missing_values = ["n/a", "--", "N/A"]
cnt = 0
for i in df["purchase"]:
if i in missing_values:
df.loc[cnt, "purchase"] = "unavailable"
cnt+=1
print(df["purchase"].unique())
[None 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower' 'grill'
'household cleaner' 'camera' 'snow shovel' 'shoes' 'blender' 'shirts'
'toaster' 'detergent' 'tv' 'unavailable' 'tools' 'slow cooker' 'pants'
'audio' 'microwave' 'food processor' 'soap' 'jackets' 'cell phone'
'paper products' 'flower pot']
Taking a look, all three of the non-standard missing values were replaced with the new class label, “unavailable”.
Finally, I finished up by replacing the standard missing value types using the “fillna()” method.
df["purchase"].fillna("unavailable", inplace=True)
print(sum(df["purchase"].isna()))
print(df["purchase"].unique())
0
['unavailable' 'shorts' 'lawn_mower' 'laptop' 'car wash' 'lawn mower'
'grill' 'household cleaner' 'camera' 'snow shovel' 'shoes' 'blender'
'shirts' 'toaster' 'detergent' 'tv' 'tools' 'slow cooker' 'pants' 'audio'
'microwave' 'food processor' 'soap' 'jackets' 'cell phone'
'paper products' 'flower pot']
Taking a final look at the “purchase” column using the “sum()” and “unique()” methods confirms that all missing values have been detected and changed to “unavailable”.
Sometimes missing data can provide additional information, so rather than throwing it out I decided to label all of it and keep it. I would consult with my team and the marketing group before disposing of it. The missing data could convey valuable information about customer experience, so it’s better to keep that data for now.
For the full data cleaning notebook, please refer to my github page [provide link to your github page here].
3. Reshaping the Data
The final part is reshaping the data. Talk about how you used “pivot_table” to display the data in the correct format. This is the format requested by the hypothetical sales department in this project.
After importing and cleaning the data, the final step was to reshape the data.
During the cleaning portion the cleaned dataset was exported as a clean CSV, so I’m going to import and reshape that clean dataset.
I’ll start by importing the clean dataset, “customer_data_cleaned.csv”.
df = pd.read_csv("customer_data_cleaned.csv")
df.head()
Unnamed: 0 | amount | category | city | customer_id | date | frequently_bought_together | lat_lon | purchase | related_items | state | zip_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 24.64 | household | Chicago | 100191 | 2014-01-01 | towels | 41.86,-87.619 | NaN | towels | IL | 60605 |
1 | 1 | 35.00 | clothing | Dallas | 100199 | 2014-01-02 | sandals | 32.924,-96.547 | shorts | belts | TX | 75089 |
2 | 2 | 89.72 | outdoor | Philadelphia | 100170 | 2014-01-03 | lawn bags | 40.002,-75.118 | lawn_mower | shovels | PA | 19019 |
3 | 3 | 51.32 | electronics | Chicago | 100124 | 2014-01-04 | headphones | 41.88,-87.63 | laptop | headphones | IL | 60603 |
4 | 4 | 81.75 | outdoor | Philadelphia | 100173 | 2014-01-05 | sponge | 39.953,-75.166 | car wash | sponge | PA | 19102 |
The marketing department is interested in lookin at how much each customer spends in each category.
To make it easier for them to look at this data, I’ll subset my full dataframe, and reshape it using the “pivot_table” method. Since they’re probably interested in total spending in each category, I’ll use the “sum” method from Numpy as my aggregation function.
# pivot table; aggregation function "sum"
import numpy as np
df_subset = df[["customer_id", "category", "amount"]]
#print(df_subset)
df_pivot = df_subset.pivot_table(index="customer_id", columns="category", values="amount", aggfunc=np.sum)
print(df_pivot)
category appliances clothing electronics household outdoor
customer_id
100100 98.25 NaN NaN NaN NaN
100101 51.46 NaN NaN NaN NaN
100102 NaN NaN NaN 70.66 NaN
100103 86.52 46.02 78.61 NaN NaN
100105 NaN NaN NaN NaN 50.71
100106 26.48 NaN 183.88 NaN NaN
100107 NaN NaN NaN 90.84 NaN
100108 NaN 90.84 NaN NaN NaN
100109 NaN NaN NaN NaN 31.79
100111 82.12 NaN NaN NaN 77.28
100115 NaN 99.06 NaN NaN NaN
100116 NaN NaN NaN NaN 166.81
100118 30.55 82.35 NaN NaN NaN
100119 NaN NaN NaN NaN 26.66
100120 NaN 86.29 41.18 NaN NaN
100123 34.57 89.68 NaN NaN NaN
100124 53.94 NaN 89.93 58.73 NaN
100128 NaN NaN NaN 94.16 NaN
100129 NaN 64.70 NaN NaN NaN
100130 25.83 NaN NaN NaN NaN
100131 NaN NaN NaN NaN 49.41
100132 NaN NaN 142.65 NaN NaN
100133 NaN 23.69 NaN NaN 49.57
100136 NaN 79.43 NaN NaN NaN
100138 NaN NaN NaN 28.74 NaN
100140 85.23 53.09 NaN 85.92 NaN
100142 NaN NaN NaN 91.23 59.02
100143 NaN 52.99 NaN NaN NaN
100145 74.52 NaN NaN NaN NaN
100148 NaN NaN NaN 35.03 NaN
... ... ... ... ... ...
100153 NaN 96.87 87.08 NaN NaN
100154 NaN NaN 21.32 NaN NaN
100156 NaN NaN 37.32 NaN NaN
100158 NaN NaN NaN NaN 171.74
100159 NaN 91.98 NaN NaN NaN
100160 NaN NaN 89.36 NaN NaN
100162 70.18 76.08 NaN 69.28 NaN
100165 NaN NaN NaN 142.68 NaN
100166 NaN NaN 44.31 NaN NaN
100167 91.86 NaN NaN 76.69 NaN
100168 NaN 58.43 NaN NaN NaN
100170 NaN NaN NaN NaN 89.72
100173 NaN NaN NaN NaN 81.75
100180 NaN 48.03 NaN NaN NaN
100182 64.86 NaN NaN NaN NaN
100183 NaN 66.56 NaN NaN NaN
100185 75.00 78.81 54.64 NaN NaN
100186 63.54 NaN 97.60 NaN NaN
100187 NaN NaN 65.79 NaN 90.37
100188 NaN NaN NaN 53.80 95.85
100189 90.89 NaN NaN NaN NaN
100190 NaN NaN 61.33 NaN NaN
100191 NaN 159.16 NaN 24.64 NaN
100192 85.00 78.11 80.27 NaN NaN
100193 97.59 NaN NaN NaN NaN
100194 NaN 77.83 67.20 NaN NaN
100195 NaN NaN NaN NaN 81.17
100196 NaN 27.37 NaN NaN NaN
100198 NaN 76.43 NaN NaN NaN
100199 NaN 61.81 NaN NaN NaN
[64 rows x 5 columns]
Taking a look, we now have a more neatly shaped dataframe for the marketing department. The total amount that each customer has spent in each category is shown. A lot of customers didn’t spend any money in certain categories, which is represented by “NaN”.
Conclusion
In this project I cleaned a messy JSON dataset. I also worked with XML data, and delivered my cleaned dataset in CSV format.
I worked through a data cleaning plan that involved investigating and cleaning the dataset for missing values, incorrect data types, inconsistent data, and data inaccuracies.
Finally, the data was reshaped to more concisely organize the data in the format requested by the marketing department. This reshaped format would make it easier for them to investigate spending trends of each customer based on the various categories.
It’s important to investigate and clean data because dirty data can lead to errors during exploratory analysis and machine learning.
Used Tools:
- Totorial to build a Python Data Wrangling Project and build a new website
- Git
- GitHub
- Atom
- Jupyter Notebook
- Python Coding
- Anaconda
- 04-2020/ Nick Scherer.