Predicting Wildfires - Part 1

30 January, 2018

Tying up my time at Mapillary in mid January, I decided that instead of diving directly into a job hunt, that I would take a bit of time to hone existing skills and develop some new. Ever since extracting and visualizing sentiment data from New York Times comments as an undergraduate in design school, I've been attracted to data analysis as a means to gain and share a greater understanding of the world. With some time on my hands and a passion for writing software in Python, I decided to dive in heads first. This is a first this series of blog posts where I'll catalog my self-education, share some of what I build, and hopefully provide an opportunity for others with an interest to learn alongside.

Self Education

Thus far, my eduction has been self driven on Kaggle - there's a ton of knowledge there! While my experience diving into the domain has been pretty humbling -- pretty much everyone around me knows so much more, it's also been enlightening and energizing -- the possibilities are boundless.

Thinking about next steps, I plan to:

  • complete an end to end machine learning project and publish my progress on this blog and on Kaggle
  • complete fast.ai Deep Learning Part 1
  • build a deep learning computer
  • compete in at least one Kaggle competition, hopefully joining a team to boost my learning

E2E: Predicting the cause of Wildfires in the United States

Surveying the datasets available to work with on Kaggle, I was immediately attracted to a dataset describing 1.88M US Wildfires over 15 years originally published by the US Forest Service. Between a geospatial component, high topical relevance, and personal interest in the subject, I decided that I'd focus my first end to end project on predicting the cause of a Fire given information available when the fire began. Given the limited information available in the data set, it's highly likely that integrating additional data such as historical weather or land use, will be essential to building a strong model.

This first blog post outlines covers the initial steps of preparing an environment and data to begin working with it.

wildfire on mountain 1600

Changes

Since the project is still moving, I expect the content here to change, stay tuned here.

2017/1/30 - initial post 2017/2/9 - added some new imports used later in notebook, addressed warning that I had incorrectly suppressed 2017/2/13 - updating configuration cell to add some style to our plots

Some notes on Notebooks/Jupyter/Kaggle

This post, and work, was completed in a Jupyter notebook running in a docker container. While my process has been roughly cataloged here, throughout the project, the container has been modified a bit to update existing tools and provide new tools. The Dockerfile and associated notebooks can be found on github, and I'll write a blog post about it sooner or later. I also suspect that the code, verbatim, won't run on kaggle -- there are some geo related libraries that likely need to be installed.

Prepare Environment

The first thing that we need to do is prepare our working environment by importing libraries and performing some configuration incantations.

Load Libraries

import itertools
import functools
import math
import os

from IPython.core.display import HTML
import geopandas as gpd
import graphviz
import matplotlib as mpl
from matplotlib import pyplot as plt
import numpy as np
import palettable
import pandas as pd
import pandas.tools.plotting as pdplot
import pprint
import pyproj
import rasterio
import rasterio.mask
import rasterio.plot
from rasterstats import zonal_stats
import seaborn as sns
import shapely
from shapely.ops import transform
import sklearn
from sklearn import model_selection
import subprocess
from geoalchemy2 import Geometry, WKTElement
from sqlalchemy.types import String, JSON
from sqlalchemy import create_engine
import sqlite3
from tqdm import tqdm , tqdm_notebook

Configure

%matplotlib inline

qual_colormap = palettable.matplotlib.Inferno_20
quant_colormap = palettable.matplotlib.Inferno_20_r

plt.style.use('ggplot')

plt.rcParams['figure.figsize'] = (15, 15)
plt.rcParams['agg.path.chunksize'] = 100000
plt.rcParams['font.family'] = "Rubik"
plt.rcParams["font.weight"] = "300"
plt.rcParams["font.size"] = "14"
plt.rcParams["axes.labelsize"] = "14"
plt.rcParams["ytick.labelsize"] = "12"
plt.rcParams["xtick.labelsize"] = "12"

sns.set_palette(qual_colormap.mpl_colors)

Load Data

Now that we have our libraries in place, we'll load in our data and have a look at it. Since our data comes in a SQLite format, we can use Pandas' read_sql_query functionality to build a data frame. Since we're not yet acquainted with the contents of our data, we'll load all provided columns and have a look.

input_filename = '/data/188-million-us-wildfires/src/FPA_FOD_20170508.sqlite'
conn = sqlite3.connect(input_filename)

query = '''
    SELECT
        NWCG_REPORTING_AGENCY,
        NWCG_REPORTING_UNIT_ID,
        NWCG_REPORTING_UNIT_NAME,
        FIRE_NAME,
        COMPLEX_NAME,
        FIRE_YEAR,
        DISCOVERY_DATE,
        DISCOVERY_DOY,
        DISCOVERY_TIME,
        STAT_CAUSE_CODE,
        STAT_CAUSE_DESCR,
        CONT_DATE,
        CONT_DOY,
        CONT_TIME,
        FIRE_SIZE,
        FIRE_SIZE_CLASS,
        LATITUDE,
        LONGITUDE,
        OWNER_CODE,
        OWNER_DESCR,
        STATE,
        COUNTY
    FROM
        Fires;
'''

raw_df = pd.read_sql_query(query, conn)

Review Raw Data

Now that our data is loaded, let's give it a very high level look and start to develop an understanding of what we're working with.

Info

Let's have a look at our column names and the type of data in each column.

raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1880465 entries, 0 to 1880464
Data columns (total 22 columns):
NWCG_REPORTING_AGENCY       object
NWCG_REPORTING_UNIT_ID      object
NWCG_REPORTING_UNIT_NAME    object
FIRE_NAME                   object
COMPLEX_NAME                object
FIRE_YEAR                   int64
DISCOVERY_DATE              float64
DISCOVERY_DOY               int64
DISCOVERY_TIME              object
STAT_CAUSE_CODE             float64
STAT_CAUSE_DESCR            object
CONT_DATE                   float64
CONT_DOY                    float64
CONT_TIME                   object
FIRE_SIZE                   float64
FIRE_SIZE_CLASS             object
LATITUDE                    float64
LONGITUDE                   float64
OWNER_CODE                  float64
OWNER_DESCR                 object
STATE                       object
COUNTY                      object
dtypes: float64(8), int64(2), object(12)
memory usage: 315.6+ MB

Missing Values

Let's see how many values are missing in each column.

raw_df.isna().sum()
NWCG_REPORTING_AGENCY             0
NWCG_REPORTING_UNIT_ID            0
NWCG_REPORTING_UNIT_NAME          0
FIRE_NAME                    957189
COMPLEX_NAME                1875282
FIRE_YEAR                         0
DISCOVERY_DATE                    0
DISCOVERY_DOY                     0
DISCOVERY_TIME               882638
STAT_CAUSE_CODE                   0
STAT_CAUSE_DESCR                  0
CONT_DATE                    891531
CONT_DOY                     891531
CONT_TIME                    972173
FIRE_SIZE                         0
FIRE_SIZE_CLASS                   0
LATITUDE                          0
LONGITUDE                         0
OWNER_CODE                        0
OWNER_DESCR                       0
STATE                             0
COUNTY                       678148
dtype: int64

Sample

Let's look at some sample data.

raw_df.sample(10)
NWCG_REPORTING_AGENCY NWCG_REPORTING_UNIT_ID NWCG_REPORTING_UNIT_NAME FIRE_NAME COMPLEX_NAME FIRE_YEAR DISCOVERY_DATE DISCOVERY_DOY DISCOVERY_TIME STAT_CAUSE_CODE ... CONT_DOY CONT_TIME FIRE_SIZE FIRE_SIZE_CLASS LATITUDE LONGITUDE OWNER_CODE OWNER_DESCR STATE COUNTY
1036316 ST/C&L USNCNCS North Carolina Forest Service YOU GO TO THIS ONE None 2000 2451852.5 309 None 8.0 ... NaN None 0.20 A 34.816700 -79.383300 14.0 MISSING/NOT SPECIFIED NC None
1144181 ST/C&L USCASLU San Luis Obispo Unit RANGE None 1993 2449220.5 233 None 2.0 ... NaN None 4.00 B 35.331111 -120.710000 14.0 MISSING/NOT SPECIFIED CA None
704956 ST/C&L USMSMSS Mississippi Forestry Commission None None 1995 2449770.5 53 1349 5.0 ... 53.0 1521 6.00 B 34.911615 -88.970655 14.0 MISSING/NOT SPECIFIED MS Tippah
507237 ST/C&L USMNMNS Minnesota Department of Natural Resources None None 2001 2452028.5 119 None 2.0 ... NaN None 4.00 B 46.123636 -94.769860 14.0 MISSING/NOT SPECIFIED MN Todd
473711 ST/C&L USLALAS Louisiana Office of Forestry None None 2005 2453507.5 137 None 1.0 ... NaN None 4.00 B 31.504505 -92.512665 14.0 MISSING/NOT SPECIFIED LA Grant
1573233 NPS USCODSP Dinosaur National Monument WAPITI None 2012 2456150.5 224 1501 1.0 ... 227.0 1415 104.00 D 40.399060 -108.306250 8.0 PRIVATE CO None
1868546 ST/C&L USHICNTY Hawaii Counties None None 2012 2455927.5 1 None 13.0 ... NaN None 0.01 A 20.862314 -156.476700 14.0 MISSING/NOT SPECIFIED HI Mauii
1195322 ST/C&L USNYNYX Fire Department of New York None None 2010 2455356.5 160 None 9.0 ... NaN None 0.10 A 40.896927 -72.906787 14.0 MISSING/NOT SPECIFIED NY SUFFOLK
483475 ST/C&L USMEMES Maine Forest Service None None 2005 2453689.5 319 None 5.0 ... NaN None 0.01 A 43.407530 -70.764500 14.0 MISSING/NOT SPECIFIED ME York
166478 FS USAZTNF Tonto National Forest OX None 2004 2453161.5 157 1628 9.0 ... 157.0 1715 0.20 A 34.180833 -111.335556 5.0 USFS AZ 7

10 rows × 22 columns

Describe

raw_df.describe(include='all')
NWCG_REPORTING_AGENCY NWCG_REPORTING_UNIT_ID NWCG_REPORTING_UNIT_NAME FIRE_NAME COMPLEX_NAME FIRE_YEAR DISCOVERY_DATE DISCOVERY_DOY DISCOVERY_TIME STAT_CAUSE_CODE ... CONT_DOY CONT_TIME FIRE_SIZE FIRE_SIZE_CLASS LATITUDE LONGITUDE OWNER_CODE OWNER_DESCR STATE COUNTY
count 1880465 1880465 1880465 923276 5183 1.880465e+06 1.880465e+06 1.880465e+06 997827 1.880465e+06 ... 988934.000000 908292 1.880465e+06 1880465 1.880465e+06 1.880465e+06 1.880465e+06 1880465 1880465 1202317
unique 11 1640 1635 493633 1416 NaN NaN NaN 1440 NaN ... NaN 1441 NaN 7 NaN NaN NaN 16 52 3455
top ST/C&L USGAGAS Georgia Forestry Commission GRASS FIRE OSAGE-MIAMI COMPLEX NaN NaN NaN 1400 NaN ... NaN 1800 NaN B NaN NaN NaN MISSING/NOT SPECIFIED CA 5
freq 1377090 167123 167123 3983 54 NaN NaN NaN 20981 NaN ... NaN 38078 NaN 939376 NaN NaN NaN 1050835 189550 7576
mean NaN NaN NaN NaN NaN 2.003710e+03 2.453064e+06 1.647191e+02 NaN 5.979037e+00 ... 172.656766 NaN 7.452016e+01 NaN 3.678121e+01 -9.570494e+01 1.059658e+01 NaN NaN NaN
std NaN NaN NaN NaN NaN 6.663099e+00 2.434573e+03 9.003891e+01 NaN 3.483860e+00 ... 84.320348 NaN 2.497598e+03 NaN 6.139031e+00 1.671694e+01 4.404662e+00 NaN NaN NaN
min NaN NaN NaN NaN NaN 1.992000e+03 2.448622e+06 1.000000e+00 NaN 1.000000e+00 ... 1.000000 NaN 1.000000e-05 NaN 1.793972e+01 -1.788026e+02 0.000000e+00 NaN NaN NaN
25% NaN NaN NaN NaN NaN 1.998000e+03 2.451084e+06 8.900000e+01 NaN 3.000000e+00 ... 102.000000 NaN 1.000000e-01 NaN 3.281860e+01 -1.103635e+02 8.000000e+00 NaN NaN NaN
50% NaN NaN NaN NaN NaN 2.004000e+03 2.453178e+06 1.640000e+02 NaN 5.000000e+00 ... 181.000000 NaN 1.000000e+00 NaN 3.545250e+01 -9.204304e+01 1.400000e+01 NaN NaN NaN
75% NaN NaN NaN NaN NaN 2.009000e+03 2.455036e+06 2.300000e+02 NaN 9.000000e+00 ... 232.000000 NaN 3.300000e+00 NaN 4.082720e+01 -8.229760e+01 1.400000e+01 NaN NaN NaN
max NaN NaN NaN NaN NaN 2.015000e+03 2.457388e+06 3.660000e+02 NaN 1.300000e+01 ... 366.000000 NaN 6.069450e+05 NaN 7.033060e+01 -6.525694e+01 1.500000e+01 NaN NaN NaN

11 rows × 22 columns

Observations

Considering that our investigation is around predicting the cause of a wildfire from the time that it was discovered, I've made the following initial observations about the data:

  • STATCAUSECODE and STATCAUSEDESCR are related and represent the value that we are trying to predict. Before training, we'll drop STATCAUSEDESCR in favor of the numerical value of STATCAUSECODE.
  • OWNER_CODE and OWNER_DESCR are related and describe the owner of the property where the fire was discovered. This is an interesting value because it represents the land management and usage of a particular peice of land. These will be interesting in our investigation. Before training, we'll drop OWNER_DESCR in favor of the numerical value of OWNER_CODE.
  • DISCOVERY_DATE, DISCOVERY_DOY, DISCOVERY_TIME describe the time that a fire was discovered. DISCOVERY_DOY is most interesting to our investigation due to it's relation to climate and usage patterns of a particular peice of land. DISCOVERY_TIME may be interesting due, but also might be too fine grained, additionally, it's missing values - let's drop it for now. DISCOVERY_DATE is too specific to be useful.
  • LATITUDE, and LONGITUDE are both very interesting due to their very high relationship to land cover, land use, and climate - all three big factors in wildfire creation.
  • STATE and both categorically describe the location of a fire. STATE might be interesting due to it's relation in land use patterns. STATE also might prove to be a useful generalization of the more specific LATITUDE, and *LONGITUDE.
  • COUNTY, while potentially interesting, has too many missing values. If we want to more closely explore categorical location data, we can add it via a geocoding process in the data engineering process.
  • A number of columns contain information about how a fire was addressed and not about what caused the fire. Lets' ignore the following columns for now: NWCGREPORTINGAGENCY, NWCGREPORTINGUNIT_ID, NWCGREPORTINGUNIT_NAME, FIRE_NAME, FIRE_COMPLEX, CONT_DATE, CONT_DOY, CONT_TIME, FIRE_SIZE, FIRESIZECLASS

This leaves us with the following interesting fields:

  • STATCAUSECODE
  • STATCAUSEDESCR [for EDA]
  • OWNER_CODE
  • OWNER_DESCR
  • DISCOVERY_DOY
  • LATITUDE
  • LONGITUDE
  • STATE

Some things we can keep in our back pocket for future exploration:

  • look harder at DISCOVERY_TIME
  • look harder at DISCOVERY_DATE

Create Human Readable Mappings

Before we drop our human readable columns, let's create a set of mappings that we can use to associate numerical categories back to human readable categories. We'll use these later..

Map Cause Code to Cause Description:

stat_cause_mapping = raw_df \
    .groupby(['STAT_CAUSE_DESCR', 'STAT_CAUSE_CODE']) \
    .size()\
    .to_frame()\
    .reset_index()\
    .drop(0, axis=1)\
    .set_index('STAT_CAUSE_CODE')\
    .sort_index()['STAT_CAUSE_DESCR']
stat_cause_mapping
STAT_CAUSE_CODE
1.0             Lightning
2.0         Equipment Use
3.0               Smoking
4.0              Campfire
5.0        Debris Burning
6.0              Railroad
7.0                 Arson
8.0              Children
9.0         Miscellaneous
10.0            Fireworks
11.0            Powerline
12.0            Structure
13.0    Missing/Undefined
Name: STAT_CAUSE_DESCR, dtype: object

Map Owner Code to Owner Description:

owner_code_mapping = raw_df \
    .groupby(['OWNER_DESCR', 'OWNER_CODE']) \
    .size()\
    .to_frame()\
    .reset_index()\
    .drop(0, axis=1)\
    .set_index('OWNER_CODE')\
    .sort_index()['OWNER_DESCR']
owner_code_mapping
OWNER_CODE
0.0                   FOREIGN
1.0                       BLM
2.0                       BIA
3.0                       NPS
4.0                       FWS
5.0                      USFS
6.0             OTHER FEDERAL
7.0                     STATE
8.0                   PRIVATE
9.0                    TRIBAL
10.0                      BOR
11.0                   COUNTY
12.0          MUNICIPAL/LOCAL
13.0         STATE OR PRIVATE
14.0    MISSING/NOT SPECIFIED
15.0        UNDEFINED FEDERAL
Name: OWNER_DESCR, dtype: object

Strip Data

Let's create a new dataframe that contains only the fields that we're interested in. This will reduce memory usage and help keep things tidy.

df = raw_df.copy()[[
    'STAT_CAUSE_CODE',
    'STAT_CAUSE_DESCR',
    'OWNER_CODE',
    'OWNER_DESCR',
    'DISCOVERY_DOY',
    'LATITUDE',
    'LONGITUDE',
    'STATE'
]]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1880465 entries, 0 to 1880464
Data columns (total 8 columns):
STAT_CAUSE_CODE     float64
STAT_CAUSE_DESCR    object
OWNER_CODE          float64
OWNER_DESCR         object
DISCOVERY_DOY       int64
LATITUDE            float64
LONGITUDE           float64
STATE               object
dtypes: float64(4), int64(1), object(3)
memory usage: 114.8+ MB

Split Data

We need to split the data into a train set and a test set. The train set will be used to build our model, and the test set will be used to evaluate the model.

We will use sklearn's model_selection.train_test_split to split our dataframe into two.

Last, we will create a convenience _df that allows us to access the union of the test and train sets.

train_df, test_df = model_selection.train_test_split(df)

display(HTML('''
<p>
    Number of Training Rows: {}<br />
    Number of Test Rows: {}
'''.format(train_df.shape[0], test_df.shape[0])))

_df = [train_df, test_df]
Number of Training Rows: 1410348<br />
Number of Test Rows: 470117

Conclusion (for now)

Now that we've loaded our data, given it a high level look, cleaned it up a bit, and split it into a train and test set, we're in a good position to begin some Exploratory Data Analysis. Keep an eye here for subsequent posts that will get into:

  • Exploratory Data Analysis
  • Data Engineering
  • More EDA! (Including geo viz)
  • Model Training and Evaluation
  • Ensembling Models

I hope this was helpful and/or entertaining so far. Since I'm pretty new to this, I'm sure there are, or will be errors - if you see any, please drop me a line -- I'd love to fix them right away!

Until next time,

Andrew