Predicting Wildfires - Part 3 - Data Engineering

14 February, 2018

Now that we have prepared our environment and loaded base data and started to understand the data, let's turn our focus to engineering some new features.

Engineer Data

We'll start by engineering new featured derived from our base data, and then fold in a couple of external data sources. In particular, let's build some features that describe some geographic properties about the areas where each fire began. Two factors integral to the occurrence of a wildfire, but that are only loosely captured in our base data, are land cover and human behavior.

Land Cover plays a role in wildfire creation for a number of reasons. Vegetation can be more or less flammable depending on the time of year - coniferous forests can become especially flagrant during a hot dry summer. Certain land cover classes lend themselves to different activities - forestry won't be prevalent in grasslands. To capture this in our model, we'll engineer a new feature that describes the land cover in the area surrounding each fire. We'll base the feature on Multi Resolution Land Cover, and use a binning strategy to aggregate our base points geographically.

Land cover classes from the National Land Cover Database.

Human behavior also plays a large role in the occurrence of wildfires. As we saw during EDA, over 40% of wildfires occur as a result of Debris Burning or Arson -- both human behaviors. While a rough measure at best, the presence of humans and dwellings may correlate with the cause of wildfires - particularly those of human origin. We can use US Census Grid, employing a similar binning strategy as above, to engineer a couple of features around humans in the areas around wildfires.

population density

Both of the above engineering tasks require binning. Originally, I had intended to use a geohash binning technique. At that point of the project, we were still looking at data from Alaska. Unfortunately at those latitudes, geohash becomes quite distorted. To mitigate distortion in our bins, I switched to Uber's H3 binning system.

ca compact 6 901

A compact representation of California using Uber's H3 binning tool.

H3 provides tools to generate an H3 bin from a set of coordinates, and a boundary shape for an H3 bin. In conjunction, these functions allow us to assign a bin to each wildfire, and then use that bin's geography to extract geodata from the raster datasets.

In any case, let's dive in.

Quantify Data

A couple fields in our reduced set are still non-numeric, in particular the STATE field is categorical text - let's convert this to a numeric label using sklearn.preprocessing.LabelEncoder.

label_encoder = sklearn.preprocessing.LabelEncoder()
for key, dataframe in fires_df.items():
    fires_df[key].loc[:,'STATE_CODE'] = label_encoder.fit_transform(dataframe['STATE'])

STAT_CAUSE_CODE is not zero-inded - let's use our label encoder to add a new set of labels

label_encoder = sklearn.preprocessing.LabelEncoder()
for key, dataframe in fires_df.items():
    fires_df[key].loc[:,'STAT_CAUSE_CODE_ZERO'] = label_encoder.fit_transform(dataframe['STAT_CAUSE_CODE'])

Missing Data

None of the fields that we're concerned with have missing data. We'll skip this for now.

Discovery Week of Year

Let's create a feature that reduces the resolution of DISCOVERY_DOY

 for key, dataframe in fires_df.items():
    dataframe['DISCOVERY_WEEK'] = dataframe['DISCOVERY_DOY']\
        .apply(lambda x: math.floor(x/7) + 1)

Climate Regions

Let's create a feature that abstracts the STATE column already in our data. We'll use NOAA's U.S. Climate Regions to group states into larger units that share certain environmental characteristics.

climate_regions = {
    "northwest": ["WA","OR","ID"],
    "west": ["CA","NV"],
    "southwest": ["UT","CO","AZ","NM"],
    "northern_rockies": ["MT","ND","SD","WY","NE"],
    "upper_midwest": ["KS","OK","TX","AR","LA","MS"],
    "south": ["MN","WI","MI","IA"],
    "ohio_valley": ["MO","IL","IN","OH","WV","KY","TN"],
    "southeast": ["VA","NC","SC","GA","AL","FL"],
    "northeast": ["ME","NH","VT","NY","PA","MA","RI","CT","NJ","DE","MD","DC"],
    "alaska": ["AK",],
    "hawaii": ["HI"],
    "puerto_rico": ["PR"]
}

state_region_mapping = {}
for region, region_states in climate_regions.items():
    for state in region_states:
        state_region_mapping[state] = region

for key, dataframe in fires_df.items():
    dataframe['CLIMATE_REGION'] = dataframe['STATE']\
        .apply(lambda x: state_region_mapping[x])

label_encoder = sklearn.preprocessing.LabelEncoder()
for key, dataframe in fires_df.items():
    dataframe['CLIMATE_REGION_CODE'] = label_encoder.fit_transform(dataframe['CLIMATE_REGION'])

H3 Bins

In preparation to extract data from geo rasters, or perform other geoanalysis, let's bin our wildfires using uber/h3. Binning helps mitigate variable resolution of source data, reduce computational complexity of feature engineering, and hopefully help avoid overfitting.

Since there are no python bindings for H3 - although they are forthcoming - I took the approach of calling the H3 binaries using python's subprocess. This works fairly well, especially when calls are batched.

Calcluate Bins for fires_df

Start by calculating a bin for each wildfire using geoToH3. We do this batched to improve performance. We're using resolution 5.

H3 Resolution Average Hexagon Area (km2) Average Hexagon Edge Length (km) Number of unique indexes
0 4,250,546.8477000 1,107.712591000 122
1 607,220.9782429 418.676005500 842
2 86,745.8540347 158.244655800 5,882
3 12,392.2648621 59.810857940 41,162
4 1,770.3235517 22.606379400 288,122
5 252.9033645 8.544408276 2,016,842
6 36.1290521 3.229482772 14,117,882
7 5.1612932 1.220629759 98,825,162
8 0.7373276 0.461354684 691,776,122
9 0.1053325 0.174375668 4,842,432,842
10 0.0150475 0.065907807 33,897,029,882
11 0.0021496 0.024910561 237,279,209,162
12 0.0003071 0.009415526 1,660,954,464,122
13 0.0000439 0.003559893 11,626,681,248,842
14 0.0000063 0.001348575 81,386,768,741,882
15 0.0000009 0.000509713 569,707,381,193,162
def h3_for_chunk(chunk, precision):
    lat_lon_lines = "\n".join([
        "{} {}".format(row['LATITUDE'], row['LONGITUDE']) for index,row in chunk.iterrows()
    ])
    h3 = subprocess.run(
        ['/tools/h3/bin/geoToH3', str(precision)],
        input=str.encode(lat_lon_lines),
        stdout=subprocess.PIPE).stdout.decode('utf-8').splitlines()
    return pd.DataFrame({
        "H3": h3
    }, index=chunk.index)

def chunker(seq, size):
    return (seq.iloc[pos:pos + size] for pos in range(0, len(seq), size))

for key in fires_df:
    print('Calculating bins for fires_df["{}"]'.format(key))
    h3_df = pd.DataFrame()
    with tqdm_notebook(total=fires_df[key].shape[0]) as pbar:
        for chunk in chunker(fires_df[key], 10000):
            h3_df = h3_df.append(h3_for_chunk(chunk, 5))
            pbar.update(chunk.shape[0])
    fires_df[key].drop('H3', 1, inplace=True, errors='ignore')
    fires_df[key] = fires_df[key].merge(
        h3_df,
        how='left',
        left_index=True,
        right_index=True,
        copy=False
    )
    display(fires_df[key].sample(5))
Calculating bins for fires_df["train"]
STAT_CAUSE_CODE STAT_CAUSE_DESCR OWNER_CODE OWNER_DESCR DISCOVERY_DATE DISCOVERY_DOY DISCOVERY_TIME LATITUDE LONGITUDE STATE STATE_CODE STAT_CAUSE_CODE_ZERO DISCOVERY_WEEK CLIMATE_REGION CLIMATE_REGION_CODE H3
1254933 5.0 Debris Burning 8.0 PRIVATE 2452145.5 236 1405 33.229500 -82.787100 GA 9 4 34 southeast 5 8544c06bfffffff
1511283 7.0 Arson 14.0 MISSING/NOT SPECIFIED 2455642.5 81 1608 31.047471 -85.837722 AL 0 6 12 southeast 5 8544e18bfffffff
601688 5.0 Debris Burning 14.0 MISSING/NOT SPECIFIED 2452544.5 270 None 31.677220 -94.518610 TX 41 4 39 upper_midwest 7 85446bc7fffffff
1302225 7.0 Arson 14.0 MISSING/NOT SPECIFIED 2454212.5 112 1116 42.788903 -75.964293 NY 32 6 17 northeast 0 852b8b43fffffff
615075 5.0 Debris Burning 14.0 MISSING/NOT SPECIFIED 2453447.5 77 None 33.083330 -94.300550 TX 41 4 12 upper_midwest 7 85444d0bfffffff
Calculating bins for fires_df["test"]
STAT_CAUSE_CODE STAT_CAUSE_DESCR OWNER_CODE OWNER_DESCR DISCOVERY_DATE DISCOVERY_DOY DISCOVERY_TIME LATITUDE LONGITUDE STATE STATE_CODE STAT_CAUSE_CODE_ZERO DISCOVERY_WEEK CLIMATE_REGION CLIMATE_REGION_CODE H3
1017024 5.0 Debris Burning 14.0 MISSING/NOT SPECIFIED 2449820.5 103 None 35.708300 -81.561700 NC 25 4 15 southeast 5 8544db97fffffff
1145718 5.0 Debris Burning 8.0 PRIVATE 2449504.5 152 None 42.532810 -123.148780 OR 35 4 22 northwest 2 852818dbfffffff
85133 4.0 Campfire 5.0 USFS 2450280.5 198 1300 39.121667 -120.436667 CA 3 3 29 west 8 8528324ffffffff
1617233 7.0 Arson 8.0 PRIVATE 2455983.5 57 1315 35.597400 -82.625850 NC 25 6 9 southeast 5 8544d983fffffff
945685 8.0 Children 14.0 MISSING/NOT SPECIFIED 2450529.5 81 None 37.483300 -77.616700 VA 43 7 12 southeast 5 852a8dd7fffffff

Isolate Bins

Identify unique bins in the wildfires dataframe and create a new GeoDataFrame, bins, containing only the bins. GeoDataFrame comes from the GeoPandas project that adds a layer of geographic processing on top of Pandas. For our use, we're most concerned with the ability to store, and plot geographic shapes. Under the covers, GeoPandas uses the Shapely library for geometric analysis.

bins = gpd.GeoDataFrame({
    'H3': np.unique(np.concatenate((fires_df["train"].H3.unique(), fires_df["test"].H3.unique())))
})

print(bins.info())
bins.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 25999 entries, 0 to 25998
Data columns (total 1 columns):
H3    25999 non-null object
dtypes: object(1)
memory usage: 203.2+ KB
None
H3
7418 8526eabbfffffff
20399 85445253fffffff
19060 852b1e3bfffffff
9713 85279033fffffff
3543 85266c37fffffff

Calculate Bounds

Calculate bounds for each bin in bins. This should batched as well - it takes quite a while to run.

def h3_bounds_for_bin(h3_bin):
    h3_bounds = subprocess.run(
        ['/tools/h3/bin/h3ToGeoBoundary'],
        input=str.encode(h3_bin),
        stdout=subprocess.PIPE).stdout.decode('utf-8').splitlines()
    coords = []
    for coord in h3_bounds[2:-1]:
        lon_lat = coord.lstrip().split(" ")
        coords.append(((float(lon_lat[1]) - 360.), float(lon_lat[0])))
    return shapely.geometry.Polygon(coords)

bins['GEOMETRY'] = bins['H3'].apply(h3_bounds_for_bin)

print(bins.info())
bins.sample(10)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 2 columns):
H3          27099 non-null object
GEOMETRY    27099 non-null object
dtypes: object(2)
memory usage: 423.5+ KB
None
H3 GEOMETRY
11178 85280243fffffff POLYGON ((-123.003254979 40.831762, -122.94977...
18725 852aa0affffffff POLYGON ((-77.10954865399998 41.466234529, -77...
10585 852793c3fffffff POLYGON ((-110.871721099 46.713100841, -110.80...
1670 8526242bfffffff POLYGON ((-95.19630366400003 46.585965214, -95...
19727 852af2affffffff POLYGON ((-76.58726582000003 36.178472701, -76...
19912 852b1857fffffff POLYGON ((-67.882502542 44.775920586, -68.0000...
24820 85488b83fffffff POLYGON ((-101.943425426 31.173151717, -101.87...
4800 85268eb3fffffff POLYGON ((-105.272718304 38.958513793, -105.20...
3552 85265ed3fffffff POLYGON ((-92.729148119 37.080579662, -92.6569...
26521 8548da13fffffff POLYGON ((-104.892143771 35.715818432, -104.82...

Now we have a geodataframe, bins, that contains all of the bins whose geography contains at least one wildfire.

Land Cover Classification

As discussed above, land cover has a direct effect in the occurrence of wildfires. In order to incorporate land cover into our model, we'll engineer a feature based on DOI & USGS's Multiple Resolution Land Cover National Land Cover Database 2011. The data comes in a raster format with a 30 meter resolution. Each pixel can be one of the 20 classes described in the table to the right. A Land Use And Land Cover Classification System For Use With Remote Sensor Data provides some interesting context for land cover classification.

We'll use Mapbox's rasterio and rasterstats to extract land cover data about each of our bins. That data will come as a dict containing a count of points for each land cover class. We'll transform that data into a format more useful for our model.

The fact that we're using land cover data from 2011 alone threatens the value of this feature. While annual land cover data is not available, there is data available from 2001, and 2006. An improvement would be to select data from each of the three sets, 2001, 2006, or 2011, as appropriate.

We'll start by making a copy of our bins dataframe so that we can experiment without trashing upstream work.

bins_land_cover = bins.copy()

Let's define the path to our data file.

land_cover_path = "/data/188-million-us-wildfires/src/nlcd_2011_landcover_2011_edition_2014_10_10/nlcd_2011_landcover_2011_edition_2014_10_10.img"

Define Coordinate Transformation

Before we use rasterstats to count the values in each bin, we use pyproj to reproject project our bin's geometry into the CRS of our datasource.

project = functools.partial(
    pyproj.transform,
    pyproj.Proj(init='EPSG:4326'),
    pyproj.Proj('+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'))

Raster Samples

Before we extract information from our source raster, let's preview the contents to sanity check.

land_cover_data = rasterio.open(land_cover_path)

def render_masked_bin(row):
    return rasterio.mask.mask(
        land_cover_data,
        [shapely.geometry.mapping(transform(project, row['GEOMETRY']))],
        crop=True
    )

masked_bin_images = bins_land_cover.sample(6).apply(render_masked_bin, 1).tolist()

fig, plots = plt.subplots(3, 2)
for index, val in enumerate(masked_bin_images):
    my_plot = plots[math.floor(index/2)][index % 2]
    rasterio.plot.show(val[0], ax=my_plot, transform=val[1], cmap='gist_earth')
    my_plot.axis('on')

png

Calculate Land Cover

Now that we've confirmed that the data we're masking from our source raster looks reasonable, let's get to extracting it. Let's employ zonal_stats with categorial=True - this will returns out dict of counts for each value found in the masked region.

land_cover_path = "/data/188-million-us-wildfires/src/nlcd_2011_landcover_2011_edition_2014_10_10/nlcd_2011_landcover_2011_edition_2014_10_10.img"


def land_cover_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(transform(project, geometry))],
        land_cover_path,
        nodata=-999,
        categorical=True
    )
    return stats[0]

bins_land_cover = bins.copy()
bins_land_cover["LAND_COVER"] = bins_land_cover["GEOMETRY"].apply(land_cover_for_geometry)

print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 3 columns):
H3            27099 non-null object
LAND_COVER    27099 non-null object
GEOMETRY      27099 non-null object
dtypes: object(3)
memory usage: 635.2+ KB
None
H3 LAND_COVER GEOMETRY
16644 8529ac17fffffff {'21': 8414, '22': 188, '23': 7, '31': 3385, '... POLYGON ((-119.761306269 34.848452511, -119.70...
6266 8526b46bfffffff {'11': 2814, '12': 448, '31': 1957, '41': 6709... POLYGON ((-109.733245976 43.359435343, -109.66...
25823 8548c8a3fffffff {'11': 62, '21': 946, '22': 1538, '23': 134, '... POLYGON ((-110.416144017 34.938354721, -110.35...
24302 8544f17bfffffff {'11': 977, '21': 24201, '22': 18629, '23': 42... POLYGON ((-82.27732916000002 31.188086238, -82...
8556 85270b1bfffffff {'11': 30430, '21': 6245, '22': 3832, '23': 11... POLYGON ((-93.92304508000001 47.431408525, -94...

Calculate Primary Class

Let's create a feature describing the primary land cover level II class for each bin.

def primary_land_cover(land_cover):
    return max(land_cover, key=land_cover.get)

bins_land_cover["PRIMARY_LAND_COVER"] = bins_land_cover["LAND_COVER"].apply(primary_land_cover)

print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 4 columns):
H3                    27099 non-null object
LAND_COVER            27099 non-null object
GEOMETRY              27099 non-null object
PRIMARY_LAND_COVER    27099 non-null object
dtypes: object(4)
memory usage: 846.9+ KB
None
H3 LAND_COVER GEOMETRY PRIMARY_LAND_COVER
8427 85270837fffffff {'11': 21425, '21': 4510, '22': 857, '23': 235... POLYGON ((-92.87331802199998 48.073667863, -92... 90
18892 852aa3c3fffffff {'11': 3339, '21': 21052, '22': 7207, '23': 22... POLYGON ((-78.347508672 41.131021516, -78.4469... 41
24658 854883d7fffffff {'11': 177, '21': 8740, '22': 1837, '23': 61, ... POLYGON ((-99.61025040200002 28.581772583, -99... 52
26656 8548dcb7fffffff {'21': 1208, '22': 612, '23': 106, '31': 677, ... POLYGON ((-106.971115621 34.134266323, -106.90... 52
8010 8526e9c3fffffff {'11': 392, '21': 10943, '22': 2740, '23': 645... POLYGON ((-94.26269021799999 35.502182409, -94... 41

Categorize Primary Class

Let's group each bin's primary land cover class into it's Level I class.

land_cover_categories = {
    "water": [11, 12],
    "developed": [21, 22, 23, 24],
    "barren": [31],
    "forest": [41, 42, 43],
    "shrubland": [51, 52],
    "herbaceous": [71, 72, 73, 74],
    "planted_cultivated": [81, 82],
    "wetlands": [90, 95]
}

land_cover_categories_map = {}
for category, values in land_cover_categories.items():
    for value in values:
        land_cover_categories_map[value] = category

def land_cover_category_for_class(land_cover_class):
    return land_cover_categories_map[int(land_cover_class)] if int(land_cover_class) in land_cover_categories_map else 'unknown'

bins_land_cover["PRIMARY_LAND_COVER_CATEGORY"] = bins_land_cover["PRIMARY_LAND_COVER"].apply(land_cover_category_for_class)

print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 5 columns):
H3                             27099 non-null object
LAND_COVER                     27099 non-null object
GEOMETRY                       27099 non-null object
PRIMARY_LAND_COVER             27099 non-null object
PRIMARY_LAND_COVER_CATEGORY    27099 non-null object
dtypes: object(5)
memory usage: 1.0+ MB
None
H3 LAND_COVER GEOMETRY PRIMARY_LAND_COVER PRIMARY_LAND_COVER_CATEGORY
9264 85275093fffffff {'11': 14071, '21': 10506, '22': 254, '23': 19... POLYGON ((-90.94032905199998 46.288750521, -91... 90 wetlands
21200 85444d57fffffff {'11': 1358, '21': 11153, '22': 4413, '23': 17... POLYGON ((-94.09936341299999 32.872628429, -94... 42 forest
2062 85262e4bfffffff {'11': 4610, '21': 13418, '22': 1594, '23': 44... POLYGON ((-95.42043053700002 44.125116576, -95... 82 planted_cultivated
1727 8526250ffffffff {'11': 693, '21': 9651, '22': 2978, '23': 942,... POLYGON ((-94.09905895100002 46.105765621, -94... 81 planted_cultivated
12075 85283457fffffff {'11': 2115, '21': 14075, '22': 2299, '23': 57... POLYGON ((-121.731573885 37.047433286, -121.67... 43 forest

Encode

Employ LabelEncoder to assign a numeric value to each category.

for key, dataframe in fires_df.items():
    label_encoder = sklearn.preprocessing.LabelEncoder()
    bins_land_cover.loc[:,'PRIMARY_LAND_COVER_CODE'] = label_encoder.fit_transform(bins_land_cover['PRIMARY_LAND_COVER'])
    label_encoder = sklearn.preprocessing.LabelEncoder()
    bins_land_cover.loc[:,'PRIMARY_LAND_COVER_CATEGORY_CODE'] = label_encoder.fit_transform(bins_land_cover['PRIMARY_LAND_COVER_CATEGORY'])

print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 7 columns):
H3                                  27099 non-null object
LAND_COVER                          27099 non-null object
GEOMETRY                            27099 non-null object
PRIMARY_LAND_COVER                  27099 non-null object
PRIMARY_LAND_COVER_CATEGORY         27099 non-null object
PRIMARY_LAND_COVER_CODE             27099 non-null int64
PRIMARY_LAND_COVER_CATEGORY_CODE    27099 non-null int64
dtypes: int64(2), object(5)
memory usage: 1.4+ MB
None
H3 LAND_COVER GEOMETRY PRIMARY_LAND_COVER PRIMARY_LAND_COVER_CATEGORY PRIMARY_LAND_COVER_CODE PRIMARY_LAND_COVER_CATEGORY_CODE
12157 8528806bfffffff {'11': 250, '31': 5, '41': 32, '42': 178500, '... POLYGON ((-115.591008981 44.7829381, -115.5268... 42 forest 8 2
22925 8544cb47fffffff {'11': 448, '21': 13178, '22': 5027, '23': 156... POLYGON ((-84.00949905599998 36.478531766, -83... 41 forest 7 2
13345 85289dc3fffffff {'11': 1694, '21': 3039, '22': 95, '23': 13, '... POLYGON ((-115.582331357 46.102164083, -115.51... 42 forest 8 2
12794 85289163fffffff {'11': 981, '21': 3125, '22': 225, '23': 21, '... POLYGON ((-113.068429643 45.777292588, -113.00... 42 forest 8 2
13156 85289a03fffffff {'11': 296, '31': 10575, '42': 187046, '43': 4... POLYGON ((-113.206564175 47.439544808, -113.13... 42 forest 8 2

Land Cover Percentage

After all the above categorizing, let's make some continuous values. We'll create a new column for each land cover class in each bin that describes how much of that bin is covered by that class. Since we'll have some null values in the resulting dataset, let's make sure to set those to 0.

drop_cols = [c for c in bins_land_cover.columns if c[:15] == "LAND_COVER_PCT_"]
bins_land_cover = bins_land_cover.drop(drop_cols, 1)

def land_cover_percentages(row):
    total = 0
    for lc_class, val in row['LAND_COVER'].items():
        total = total + int(val)
    output = {}
    for lc_class, val in row['LAND_COVER'].items():
        pct = val / total
        row['LAND_COVER_PCT_{}'.format(lc_class)] = pct
    return row

bins_land_cover = bins_land_cover.apply(land_cover_percentages, 1)

bins_land_cover.loc[
    :,[c for c in bins_land_cover.columns if c[:15] == "LAND_COVER_PCT_"]
] = bins_land_cover[[c for c in bins_land_cover.columns if c[:15] == "LAND_COVER_PCT_"]].fillna(0)

print(bins_land_cover.info())
bins_land_cover.sample(5)
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 7 columns):
H3                                  27099 non-null object
LAND_COVER                          27099 non-null object
PRIMARY_LAND_COVER                  27099 non-null object
PRIMARY_LAND_COVER_CATEGORY         27099 non-null object
PRIMARY_LAND_COVER_CATEGORY_CODE    27099 non-null int64
PRIMARY_LAND_COVER_CODE             27099 non-null int64
GEOMETRY                            27099 non-null object
dtypes: int64(2), object(5)
memory usage: 1.4+ MB
None
H3 LAND_COVER PRIMARY_LAND_COVER PRIMARY_LAND_COVER_CATEGORY PRIMARY_LAND_COVER_CATEGORY_CODE PRIMARY_LAND_COVER_CODE GEOMETRY
18093 852a8b83fffffff {'11': 1779, '21': 8729, '22': 2660, '23': 769... 41 forest 2 7 POLYGON ((-78.600277807 37.661639755, -78.6924...
10184 852783b7fffffff {'11': 47, '21': 800, '22': 52, '31': 1537, '4... 71 herbaceous 3 11 POLYGON ((-103.574894942 45.204828189, -103.50...
26470 8548d84ffffffff {'11': 10, '21': 1263, '22': 75, '31': 23, '41... 42 forest 2 8 POLYGON ((-105.458776659 35.348940866, -105.39...
14779 8528f03bfffffff {'11': 20645, '21': 14749, '22': 7655, '23': 2... 43 forest 2 9 POLYGON ((-123.601736884 46.320872208, -123.54...
7857 8526e46ffffffff {'11': 1041, '21': 10898, '22': 1571, '23': 32... 82 planted_cultivated 4 13 POLYGON ((-94.99411935400002 38.81449329, -94....

Merge onto fires_df

Merge our bins_land_cover onto our wildfire data keying on the H3 bin column.

for key, dataframe in fires_df.items():
    fires_df[key] = dataframe.merge(bins_land_cover, on="H3", how="left")
    display(fires_df[key].sample(5))
STAT_CAUSE_CODE STAT_CAUSE_DESCR OWNER_CODE OWNER_DESCR DISCOVERY_DATE DISCOVERY_DOY DISCOVERY_TIME LATITUDE LONGITUDE STATE ... DISCOVERY_WEEK CLIMATE_REGION CLIMATE_REGION_CODE H3 LAND_COVER PRIMARY_LAND_COVER PRIMARY_LAND_COVER_CATEGORY PRIMARY_LAND_COVER_CATEGORY_CODE PRIMARY_LAND_COVER_CODE GEOMETRY
324193 7.0 Arson 14.0 MISSING/NOT SPECIFIED 2451883.5 340 None 35.077500 -81.768100 SC ... 49 southeast 5 8544d8bbfffffff {'11': 3373, '21': 25890, '22': 7596, '23': 22... 41 forest 2 7 POLYGON ((-80.84926178299997 35.39708965, -80....
385381 1.0 Lightning 1.0 BLM 2451821.5 278 1250 36.167500 -115.538900 NV ... 40 west 8 852986b7fffffff {'11': 8, '21': 870, '22': 239, '23': 4, '31':... 52 shrubland 5 10 POLYGON ((-118.837357878 38.868465355, -118.78...
1010417 4.0 Campfire 5.0 USFS 2451812.5 269 1200 35.103056 -111.540833 AZ ... 39 southwest 6 8548cd2ffffffff {'11': 3208, '21': 1196, '22': 237, '23': 17, ... 42 forest 2 8 POLYGON ((-110.797864332 34.294343747, -110.73...
953964 2.0 Equipment Use 14.0 MISSING/NOT SPECIFIED 2453227.5 223 None 38.720000 -120.731944 CA ... 32 west 8 852832cffffffff {'11': 1645, '21': 14866, '22': 3334, '23': 56... 42 forest 2 8 POLYGON ((-122.09915726 37.494361459, -122.046...
827733 2.0 Equipment Use 8.0 PRIVATE 2452828.5 189 1204 42.854400 -112.013600 ID ... 28 northwest 2 8528b203fffffff {'21': 1209, '22': 1, '31': 6, '41': 34553, '4... 52 shrubland 5 10 POLYGON ((-111.859646108 42.561970602, -111.79...

5 rows × 22 columns

STAT_CAUSE_CODE STAT_CAUSE_DESCR OWNER_CODE OWNER_DESCR DISCOVERY_DATE DISCOVERY_DOY DISCOVERY_TIME LATITUDE LONGITUDE STATE ... DISCOVERY_WEEK CLIMATE_REGION CLIMATE_REGION_CODE H3 LAND_COVER PRIMARY_LAND_COVER PRIMARY_LAND_COVER_CATEGORY PRIMARY_LAND_COVER_CATEGORY_CODE PRIMARY_LAND_COVER_CODE GEOMETRY
269557 7.0 Arson 14.0 MISSING/NOT SPECIFIED 2451393.5 215 1532 34.683344 -89.124754 MS ... 31 upper_midwest 7 85445b7bfffffff {'11': 3514, '21': 9513, '22': 3877, '23': 124... 41 forest 2 7 POLYGON ((-90.98733658999998 32.724780019, -90...
689 2.0 Equipment Use 1.0 BLM 2448868.5 247 1047 34.700000 -117.167500 CA ... 36 west 8 8529a3d7fffffff {'21': 913, '22': 172, '23': 35, '31': 1615, '... 52 shrubland 5 10 POLYGON ((-116.50303393 32.759248771, -116.446...
141424 3.0 Smoking 14.0 MISSING/NOT SPECIFIED 2452647.5 8 None 29.166670 -82.200000 FL ... 2 southeast 5 8544f413fffffff {'11': 152, '21': 46492, '22': 25796, '23': 92... 42 forest 2 8 POLYGON ((-81.44679156199999 29.448254503, -81...
83447 1.0 Lightning 1.0 BLM 2453964.5 229 1412 42.964600 -118.326900 OR ... 33 northwest 2 8528aec3fffffff {'11': 2205, '31': 22, '42': 2988, '52': 27022... 52 shrubland 5 10 POLYGON ((-113.932881084 42.180656872, -113.86...
61411 7.0 Arson 12.0 MUNICIPAL/LOCAL 2452674.5 35 None 34.200500 -95.299833 OK ... 6 upper_midwest 7 8526cd83fffffff {'11': 396, '21': 8681, '22': 449, '23': 19, '... 41 forest 2 7 POLYGON ((-97.09674050699999 34.066314039, -97...

5 rows × 22 columns

Alas, we now have a handful of features related to Land Cover on our main dataframe. Let's move onto some other features!

Census Data

In addition to land cover, human behavior plays a large role the the creation of wildfires. Similarly to how we engineered features around Land Cover, we can bring in an external dataset to engineer some features around population and human behavior. Fortunately, it's possible to obtain 2010 US Census Data in a raster format allowing us to employ the the same techniques that we did for land cover.

census logo

US Census publishes a number of interesting metrics: for our purposes, we will look at Population Density, Seasonal Housing, Housing Units, and Occupied Housing Units. For each of these, we will look at the mean value for each bin.

Similarly to the Land Cover data, this method is flawed in it's use of data from 2010. Since the wildfires we're looking at occurred in the years 1995-2015, 2010 census data may not be highly relevant to some of the earlier or later datapoints.

As with Land Cover, start by creating a new dataframe to house or work with the census data.

census_data = pd.DataFrame(
    {"H3": bins['H3'].copy()},
    index=bins.index
)

Raster Samples

Before we extract information from our source raster, let's preview the contents to sanity check.

census_files = {
    "Population Density": "/data/188-million-us-wildfires/src/usgrid_data_2010/uspop10.tif",
    "Seasonal Housing": "/data/188-million-us-wildfires/src/usgrid_data_2010/ussea10.tif",
    "Housing Units": "/data/188-million-us-wildfires/src/usgrid_data_2010/ushu10.tif",
    "Occupied Housing Units": "/data/188-million-us-wildfires/src/usgrid_data_2010/usocc10.tif"
}

def render_bin(row):
    output = []
    for index, value in census_files.items():
        census_data = rasterio.open(value)
        output.append(rasterio.mask.mask(
            census_data,
            [shapely.geometry.mapping(row['GEOMETRY'])],
            crop=True,
            nodata=0
        ))
    return output

masked_bin_images = bins_land_cover.sample(2).apply(render_bin, 1).tolist()

fig, plots = plt.subplots(4, 2)
for index, val in enumerate(masked_bin_images):
    for subindex, image in enumerate(val):
        my_plot = plots[subindex][index]
        rasterio.plot.show(image[0], ax=my_plot, transform=image[1], cmap='inferno')

png

Looks reasonable, let's create our features!

Population Density

Start with Population Density, use zonal_stats to extract the mean value for each bin. Add the values as a new column to our dataframe.

census_population_path = "/data/188-million-us-wildfires/src/usgrid_data_2010/uspop10.tif"

def mean_population_density_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(geometry)],
        census_population_path,
        nodata=0,
        stats=["mean"]
    )
    return stats[0]["mean"]

census_data = census_data.drop("MEAN_POPULATION_DENSITY", 1, errors='ignore')
census_data["MEAN_POPULATION_DENSITY"] = \
    bins["GEOMETRY"].apply(mean_population_density_for_geometry).fillna(value=0).replace([-np.inf], 0)

print(census_data.info())
census_data.sample(5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 2 columns):
H3                         27099 non-null object
MEAN_POPULATION_DENSITY    27099 non-null float64
dtypes: float64(1), object(1)
memory usage: 423.5+ KB
None
H3 MEAN_POPULATION_DENSITY
14506 8528d243fffffff -9.426104e+35
16105 85299d47fffffff 0.000000e+00
11505 85281523fffffff 1.346886e+00
11892 8528311bfffffff 0.000000e+00
182 8512da2bfffffff 0.000000e+00

Seasonal Housing

census_seasonal_path = "/data/188-million-us-wildfires/src/usgrid_data_2010/ussea10.tif"

def mean_seasonal_housing_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(geometry)],
        census_seasonal_path,
        nodata=0,
        stats=["mean"]
    )
    return stats[0]["mean"]

census_data = census_data.drop("MEAN_SEASONAL_HOUSING", 1, errors='ignore')
census_data["MEAN_SEASONAL_HOUSING"] = \
    bins["GEOMETRY"].apply(mean_seasonal_housing_for_geometry).fillna(value=0).replace([-np.inf], 0)

print(census_data.info())
census_data.sample(5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 3 columns):
H3                         27099 non-null object
MEAN_POPULATION_DENSITY    27099 non-null float64
MEAN_SEASONAL_HOUSING      27099 non-null float64
dtypes: float64(2), object(1)
memory usage: 635.2+ KB
None
H3 MEAN_POPULATION_DENSITY MEAN_SEASONAL_HOUSING
23525 8544dd5bfffffff 37.240550 0.248122
19559 852ad463fffffff 27.189120 0.147108
9871 8527704ffffffff 0.000000 0.000000
13150 85289a8ffffffff 1.460237 0.939296
11226 8528063bfffffff 1.233903 0.260069

Housing Units

census_housing_units = "/data/188-million-us-wildfires/src/usgrid_data_2010/ushu10.tif"

def mean_housing_units_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(geometry)],
        census_housing_units,
        nodata=0,
        stats=["mean"]
    )
    return stats[0]["mean"]

census_data = census_data.drop("MEAN_HOUSING_UNITS", 1, errors='ignore')
census_data["MEAN_HOUSING_UNITS"] = \
    bins["GEOMETRY"].apply(mean_housing_units_for_geometry).fillna(value=0).replace([-np.inf], 0)

print(census_data.info())
census_data.sample(5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 4 columns):
H3                         27099 non-null object
MEAN_POPULATION_DENSITY    27099 non-null float64
MEAN_SEASONAL_HOUSING      27099 non-null float64
MEAN_HOUSING_UNITS         27099 non-null float64
dtypes: float64(3), object(1)
memory usage: 846.9+ KB
None
H3 MEAN_POPULATION_DENSITY MEAN_SEASONAL_HOUSING MEAN_HOUSING_UNITS
17269 852a1613fffffff 4.882109 1.034020 3.289251
3010 8526518bfffffff 7.562320 0.087591 3.296344
13867 8528ab9bfffffff 0.030536 0.058152 0.032559
17224 852a1483fffffff 441.285687 1.192067 183.367477
8780 8527402bfffffff 0.000000 0.000000 0.000000

Occupied Housing Units

census_occupied_housing_units = "/data/188-million-us-wildfires/src/usgrid_data_2010/usocc10.tif"

def mean_occupied_housing_units_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(geometry)],
        census_occupied_housing_units,
        nodata=0,
        stats=["mean"]
    )
    return stats[0]["mean"]

census_data = census_data.drop("MEAN_OCCUPIED_HOUSING_UNITS", 1, errors='ignore')
census_data["MEAN_OCCUPIED_HOUSING_UNITS"] = \
    bins["GEOMETRY"]\
        .apply(mean_occupied_housing_units_for_geometry)\
        .fillna(value=0)\
        .replace([-np.inf], 0)

print(census_data.info())
census_data.sample(5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 5 columns):
H3                             27099 non-null object
MEAN_POPULATION_DENSITY        27099 non-null float64
MEAN_SEASONAL_HOUSING          27099 non-null float64
MEAN_HOUSING_UNITS             27099 non-null float64
MEAN_OCCUPIED_HOUSING_UNITS    27099 non-null float64
dtypes: float64(4), object(1)
memory usage: 1.0+ MB
None
H3 MEAN_POPULATION_DENSITY MEAN_SEASONAL_HOUSING MEAN_HOUSING_UNITS MEAN_OCCUPIED_HOUSING_UNITS
23738 8544e2abfffffff 25.395138 0.242860 10.882212 10.882212
21173 85444d37fffffff 0.000000 0.000000 0.000000 0.000000
4213 85268093fffffff 0.060832 0.141386 0.169451 0.169451
8087 8526ec57fffffff 1.851572 0.138690 0.763857 0.763857
5865 8526aa47fffffff 29.909779 0.240654 10.667183 10.667183

Vacant Housing Units

Let's compare overall Housing Units and Occupied Housing Units to create a feature around vacant housing units.

census_data["MEAN_VACANT_HOUSING_UNITS"] = census_data["MEAN_HOUSING_UNITS"] - census_data["MEAN_OCCUPIED_HOUSING_UNITS"]

Merge onto fires_df

Let's merge our Census features onto our main data frame.

for key, dataframe in fires_df.items():
    fires_df[key] = fires_df[key]\
        .drop([
            "MEAN_POPULATION_DENSITY",
            "MEAN_SEASONAL_HOUSING",
            "MEAN_HOUSING_UNITS",
            "MEAN_OCCUPIED_HOUSING_UNITS",
            "MEAN_VACANT_HOUSING_UNITS"
        ], axis=1, errors='ignore')
    fires_df[key] = dataframe.merge(census_data, on="H3", how="left")
    display(fires_df[key].sample(5))
STAT_CAUSE_CODE STAT_CAUSE_DESCR OWNER_CODE OWNER_DESCR DISCOVERY_DATE DISCOVERY_DOY DISCOVERY_TIME LATITUDE LONGITUDE STATE ... PRIMARY_LAND_COVER PRIMARY_LAND_COVER_CATEGORY PRIMARY_LAND_COVER_CATEGORY_CODE PRIMARY_LAND_COVER_CODE GEOMETRY MEAN_POPULATION_DENSITY MEAN_SEASONAL_HOUSING MEAN_HOUSING_UNITS MEAN_OCCUPIED_HOUSING_UNITS MEAN_VACANT_HOUSING_UNITS
190846 5.0 Debris Burning 14.0 MISSING/NOT SPECIFIED 2456724.5 67 None 31.565195 -94.676084 TX ... 42 forest 2 8 POLYGON ((-95.53469668000002 30.461285287, -95... -8.861519e+35 -1.189798e+36 -8.861519e+35 -8.861519e+35 0.0
392853 2.0 Equipment Use 8.0 PRIVATE 2450414.5 332 1510 32.732400 -82.682700 GA ... 42 forest 2 8 POLYGON ((-82.16565888299999 33.140867951, -82... 1.526929e+01 2.266859e-01 4.978497e+00 4.978497e+00 0.0
236781 7.0 Arson 8.0 PRIVATE 2452523.5 249 2022 34.023425 -85.147918 GA ... 41 forest 2 7 POLYGON ((-83.54737807800001 34.783248074, -83... 4.844742e+01 2.591060e-01 2.042438e+01 2.042438e+01 0.0
953893 6.0 Railroad 8.0 PRIVATE 2449374.5 22 1540 31.515200 -81.976300 GA ... 90 wetlands 8 14 POLYGON ((-81.13579223099998 29.521477031, -81... 1.595580e+01 3.028798e-01 4.464921e+00 4.464921e+00 0.0
762546 5.0 Debris Burning 8.0 PRIVATE 2449294.5 307 1200 31.806500 -81.653100 GA ... 90 wetlands 8 14 POLYGON ((-82.56952801199998 30.150289017, -82... 1.031592e+02 5.569562e-01 4.348338e+01 4.348338e+01 0.0

5 rows × 27 columns

STAT_CAUSE_CODE STAT_CAUSE_DESCR OWNER_CODE OWNER_DESCR DISCOVERY_DATE DISCOVERY_DOY DISCOVERY_TIME LATITUDE LONGITUDE STATE ... PRIMARY_LAND_COVER PRIMARY_LAND_COVER_CATEGORY PRIMARY_LAND_COVER_CATEGORY_CODE PRIMARY_LAND_COVER_CODE GEOMETRY MEAN_POPULATION_DENSITY MEAN_SEASONAL_HOUSING MEAN_HOUSING_UNITS MEAN_OCCUPIED_HOUSING_UNITS MEAN_VACANT_HOUSING_UNITS
316901 5.0 Debris Burning 14.0 MISSING/NOT SPECIFIED 2455874.5 313 1320 43.125472 -75.218345 NY ... 41 forest 2 7 POLYGON ((-74.05886090500002 42.510741585, -74... 54.448256 0.584297 21.798168 21.798168 0.0
173974 1.0 Lightning 5.0 USFS 2449234.5 247 2030 38.996667 -112.106667 UT ... 42 forest 2 8 POLYGON ((-112.634109074 36.682798716, -112.57... 0.000000 0.048080 0.048080 0.048080 0.0
234596 5.0 Debris Burning 14.0 MISSING/NOT SPECIFIED 2453825.5 90 None 34.279960 -78.734420 NC ... 82 planted_cultivated 4 13 POLYGON ((-77.93933100999999 35.494919175, -78... 34.872965 0.330970 15.151135 15.151135 0.0
180091 5.0 Debris Burning 8.0 PRIVATE 2456756.5 99 1635 32.981111 -81.753889 GA ... 82 planted_cultivated 4 13 POLYGON ((-83.64038666099998 31.710417745, -83... 8.222949 0.186019 3.652465 3.652465 0.0
266232 4.0 Campfire 5.0 USFS 2452048.5 139 1300 38.885000 -120.033333 CA ... 42 forest 2 8 POLYGON ((-116.705052106 39.732564363, -116.64... 0.000000 0.000000 0.000000 0.000000 0.0

5 rows × 27 columns

Conclusion

Well that's all for now. In this blog post we engineered a handful of features both using data already present in our base dataset, and by incorporating new datasets. We geographically binned our datapoints, and used the geometry of each bin to extract statistics from land cover and human behavior raster files.

There is a lot that can be done to improve this - if you think there's something that could be approached differently, I want to hear from you! Please reach out: andrewmahon@fastmail.com.

Next post we'll revisit our Exploratory Data Analysis with a focus on our new features.

Updates

2017/2/14 - initial post 2017/2/22 - fixed a couple of typos