SF Fire Data Analysis With Databricks And Spark V2
Let's dive into analyzing San Francisco Fire Department (SF Fire) data using Databricks, Spark v2, and the sf-fire-calls.csv dataset. This exploration will guide you through setting up your environment, loading the data, performing exploratory data analysis (EDA), and answering some interesting questions about fire incidents in San Francisco. This is going to be a fascinating journey into the world of big data and emergency response!
Setting Up Your Databricks Environment
First things first, you need a Databricks environment. If you don't have one already, head over to the Databricks website and sign up for a community edition or a trial. Once you're in, create a new notebook. Make sure your cluster is up and running; you'll need it to execute your Spark code. Choose a cluster configuration that suits your needs. For smaller datasets like sf-fire-calls.csv, the default configuration should be sufficient. However, for larger datasets, you might want to consider increasing the memory and the number of cores to speed up processing. Always monitor your cluster's performance to optimize costs and efficiency.
Once your cluster is ready, you can start importing necessary libraries. For this analysis, you'll primarily use Spark SQL functions. These functions allow you to interact with your data using SQL-like syntax, which is super handy for data manipulation and querying. You might also want to import other libraries like matplotlib or seaborn for visualization, but let’s keep it simple for now. The key is to ensure that your environment is correctly set up and that your Spark cluster is ready to crunch some numbers. Remember to test your setup by running a simple Spark command, like reading a small CSV file, to confirm everything is working as expected. This initial step is crucial to avoid headaches down the line.
Loading the SF Fire Calls CSV Dataset
Now, let's load the sf-fire-calls.csv dataset into your Databricks notebook. You can upload the CSV file directly to the Databricks File System (DBFS) or access it from an external storage location like AWS S3 or Azure Blob Storage. For simplicity, let's assume you've uploaded the file to DBFS. To read the CSV file into a Spark DataFrame, use the following code:
from pyspark.sql.types import *
from pyspark.sql.functions import *
fire_schema = StructType([
StructField('CallNumber', IntegerType(), True),
StructField('UnitID', StringType(), True),
StructField('IncidentNumber', IntegerType(), True),
StructField('CallType', StringType(), True),
StructField('CallDate', StringType(), True),
StructField('WatchDate', StringType(), True),
StructField('CallFinalDisposition', StringType(), True),
StructField('AvailableDtTm', StringType(), True),
StructField('Address', StringType(), True),
StructField('City', StringType(), True),
StructField('Zipcode', IntegerType(), True),
StructField('Battalion', StringType(), True),
StructField('StationArea', StringType(), True),
StructField('Box', StringType(), True),
StructField('OriginalPriority', StringType(), True),
StructField('Priority', StringType(), True),
StructField('FinalPriority', IntegerType(), True),
StructField('ALSUnit', BooleanType(), True),
StructField('CallTypeGroup', StringType(), True),
StructField('NumAlarms', IntegerType(), True),
StructField('UnitType', StringType(), True),
StructField('FirstUnitOnScene', StringType(), True),
StructField('EstimatedOnSceneDtTm', StringType(), True),
StructField('Location', StringType(), True),
StructField('RowID', StringType(), True),
StructField('Delay', FloatType(), True)
])
sf_fire_file = "/FileStore/tables/sf_fire_calls.csv" # Replace with your file path
sf_fire_df = spark.read.csv(sf_fire_file, schema=fire_schema, header=True)
Make sure to replace "/FileStore/tables/sf_fire_calls.csv" with the actual path to your file in DBFS. The header=True option tells Spark that the first row of the CSV file contains the column names. After running this code, you should have a Spark DataFrame named sf_fire_df containing the fire call data. You can then use .show() method to display the first few rows of the DataFrame and get a feel for the data. This step is essential to ensure that the data is loaded correctly and that the schema is properly inferred.
Exploratory Data Analysis (EDA) with Spark
Now that you have the data loaded into a Spark DataFrame, it's time to perform some exploratory data analysis (EDA). EDA involves exploring the data to understand its structure, identify patterns, and uncover insights. Let's start with some basic operations:
Displaying the Schema
Use the .printSchema() method to display the schema of the DataFrame. This will show you the column names and their data types. Understanding the schema is crucial for writing effective queries and transformations.
sf_fire_df.printSchema()
Basic Statistics
Use the .describe() method to compute summary statistics for numeric columns, such as count, mean, standard deviation, min, and max. This can give you a quick overview of the distribution of values in each column.
sf_fire_df.describe().show()
Counting Records
Use the .count() method to find out how many records are in the DataFrame. This gives you a sense of the size of your dataset.
sf_fire_df.count()
Selecting and Filtering Data
Use the .select() method to select specific columns from the DataFrame. For example, to select the CallType and Address columns, you can use:
sf_fire_df.select("CallType", "Address").show()
Use the .filter() method to filter the DataFrame based on a condition. For example, to filter the DataFrame to only include fire calls with a CallType of "Medical Incident", you can use:
sf_fire_df.filter(sf_fire_df["CallType"] == "Medical Incident").show()
Grouping and Aggregating Data
Use the .groupBy() method to group the data by one or more columns. For example, to group the data by CallType and count the number of calls for each type, you can use:
sf_fire_df.groupBy("CallType").count().show()
These are just a few examples of the types of EDA you can perform with Spark. By exploring the data in this way, you can gain a better understanding of the fire incidents in San Francisco and identify areas for further investigation. Remember, EDA is an iterative process, so don't be afraid to experiment with different operations and visualizations to uncover hidden insights.
Answering Questions About SF Fire Incidents
Now that we've loaded the data and performed some EDA, let's use Spark to answer some interesting questions about fire incidents in San Francisco.
What are the most common types of fire calls?
To answer this question, we can group the data by CallType and count the number of calls for each type, as shown in the EDA section.
sf_fire_df.groupBy("CallType").count().orderBy(desc("count")).show(10)
This will show you the top 10 most common types of fire calls in San Francisco. Medical incidents are typically the most frequent, followed by other types of emergencies.
Which zip codes have the most fire incidents?
To answer this question, we can group the data by Zipcode and count the number of calls for each zip code.
sf_fire_df.groupBy("Zipcode").count().orderBy(desc("count")).show(10)
This will show you the top 10 zip codes with the most fire incidents. Understanding which areas have the highest incident rates can help fire departments allocate resources more effectively.
What is the average response time for fire calls?
To answer this question, we first need to convert the AvailableDtTm and CallDate columns to timestamps. Then, we can calculate the difference between these timestamps to get the response time in seconds. Note that the initial schema definition includes AvailableDtTm and CallDate as StringType(). To perform calculations, these need to be converted to timestamps.
from pyspark.sql.functions import to_timestamp, col
sf_fire_df = sf_fire_df.withColumn("CallTimestamp", to_timestamp(col("CallDate"), "MM/dd/yyyy hh:mm:ss AA"))
sf_fire_df = sf_fire_df.withColumn("AvailableTimestamp", to_timestamp(col("AvailableDtTm"), "MM/dd/yyyy hh:mm:ss AA"))
sf_fire_df = sf_fire_df.withColumn("ResponseTimeSeconds", abs(unix_timestamp(col("AvailableTimestamp")) - unix_timestamp(col("CallTimestamp"))))
sf_fire_df.select("CallDate", "AvailableDtTm", "ResponseTimeSeconds").show(5)
sf_fire_df.agg({"ResponseTimeSeconds": "avg"}).show()
This will show you the average response time for fire calls in San Francisco. Analyzing response times can help identify areas where the fire department can improve its efficiency. Make sure to handle any potential data quality issues, such as missing or invalid timestamps, to ensure accurate results. This analysis is critical for optimizing emergency response strategies.
Analyzing Specific Call Types
Let's explore the characteristics of specific call types, such as "Structure Fire". We can filter the data to only include structure fires and then analyze various aspects of these incidents.
structure_fires_df = sf_fire_df.filter(sf_fire_df["CallType"] == "Structure Fire")
structure_fires_df.count()
structure_fires_df.groupBy("Zipcode").count().orderBy(desc("count")).show(10)
This will show you the number of structure fires in the dataset and the top 10 zip codes with the most structure fires. You can further analyze structure fires by looking at other factors, such as the time of day they occur, the units that respond, and the final disposition of the calls. This level of granularity is essential for developing targeted prevention and response strategies.
Conclusion
Analyzing the SF Fire Calls dataset with Databricks and Spark v2 provides valuable insights into fire incidents in San Francisco. By setting up your environment, loading the data, performing EDA, and answering specific questions, you can gain a better understanding of the patterns and trends in the data. This knowledge can be used to improve emergency response times, allocate resources more effectively, and ultimately make the city safer. Keep exploring, experimenting, and asking questions – the possibilities are endless! This exercise demonstrates the immense power of big data analytics in solving real-world problems.