Predicting Wildfires - Part 3 - Data Engineering

Now that we have prepared our environment and loaded base data and started to understand the data, let's turn our focus to engineering some new features.

Engineer Data

We'll start by engineering new featured derived from our base data, and then fold in a couple of external data sources. In particular, let's build some features that describe some geographic properties about the areas where each fire began. Two factors integral to the occurrence of a wildfire, but that are only loosely captured in our base data, are land cover and human behavior.

Land Cover plays a role in wildfire creation for a number of reasons. Vegetation can be more or less flammable depending on the time of year - coniferous forests can become especially flagrant during a hot dry summer. Certain land cover classes lend themselves to different activities - forestry won't be prevalent in grasslands. To capture this in our model, we'll engineer a new feature that describes the land cover in the area surrounding each fire. We'll base the feature on Multi Resolution Land Cover, and use a binning strategy to aggregate our base points geographically.

Land cover classes from the National Land Cover Database.

Human behavior also plays a large role in the occurrence of wildfires. As we saw during EDA, over 40% of wildfires occur as a result of Debris Burning or Arson -- both human behaviors. While a rough measure at best, the presence of humans and dwellings may correlate with the cause of wildfires - particularly those of human origin. We can use US Census Grid, employing a similar binning strategy as above, to engineer a couple of features around humans in the areas around wildfires.

Both of the above engineering tasks require binning. Originally, I had intended to use a geohash binning technique. At that point of the project, we were still looking at data from Alaska. Unfortunately at those latitudes, geohash becomes quite distorted. To mitigate distortion in our bins, I switched to Uber's H3 binning system.

A compact representation of California using Uber's H3 binning tool.

H3 provides tools to generate an H3 bin from a set of coordinates, and a boundary shape for an H3 bin. In conjunction, these functions allow us to assign a bin to each wildfire, and then use that bin's geography to extract geodata from the raster datasets.

In any case, let's dive in.

Quantify Data

A couple fields in our reduced set are still non-numeric, in particular the STATE field is categorical text - let's convert this to a numeric label using sklearn.preprocessing.LabelEncoder.

label_encoder = sklearn.preprocessing.LabelEncoder()
for key, dataframe in fires_df.items():
    fires_df[key].loc[:,'STATE_CODE'] = label_encoder.fit_transform(dataframe['STATE'])

STAT_CAUSE_CODE is not zero-inded - let's use our label encoder to add a new set of labels

label_encoder = sklearn.preprocessing.LabelEncoder()
for key, dataframe in fires_df.items():
    fires_df[key].loc[:,'STAT_CAUSE_CODE_ZERO'] = label_encoder.fit_transform(dataframe['STAT_CAUSE_CODE'])

Missing Data

None of the fields that we're concerned with have missing data. We'll skip this for now.

Discovery Week of Year

Let's create a feature that reduces the resolution of DISCOVERY_DOY

 for key, dataframe in fires_df.items():
    dataframe['DISCOVERY_WEEK'] = dataframe['DISCOVERY_DOY']\
        .apply(lambda x: math.floor(x/7) + 1)

Climate Regions

Let's create a feature that abstracts the STATE column already in our data. We'll use NOAA's U.S. Climate Regions to group states into larger units that share certain environmental characteristics.

climate_regions = {
    "northwest": ["WA","OR","ID"],
    "west": ["CA","NV"],
    "southwest": ["UT","CO","AZ","NM"],
    "northern_rockies": ["MT","ND","SD","WY","NE"],
    "upper_midwest": ["KS","OK","TX","AR","LA","MS"],
    "south": ["MN","WI","MI","IA"],
    "ohio_valley": ["MO","IL","IN","OH","WV","KY","TN"],
    "southeast": ["VA","NC","SC","GA","AL","FL"],
    "northeast": ["ME","NH","VT","NY","PA","MA","RI","CT","NJ","DE","MD","DC"],
    "alaska": ["AK",],
    "hawaii": ["HI"],
    "puerto_rico": ["PR"]
}

state_region_mapping = {}
for region, region_states in climate_regions.items():
    for state in region_states:
        state_region_mapping[state] = region

for key, dataframe in fires_df.items():
    dataframe['CLIMATE_REGION'] = dataframe['STATE']\
        .apply(lambda x: state_region_mapping[x])

label_encoder = sklearn.preprocessing.LabelEncoder()
for key, dataframe in fires_df.items():
    dataframe['CLIMATE_REGION_CODE'] = label_encoder.fit_transform(dataframe['CLIMATE_REGION'])

H3 Bins

In preparation to extract data from geo rasters, or perform other geoanalysis, let's bin our wildfires using uber/h3. Binning helps mitigate variable resolution of source data, reduce computational complexity of feature engineering, and hopefully help avoid overfitting.

Since there are no python bindings for H3 - although they are forthcoming - I took the approach of calling the H3 binaries using python's subprocess. This works fairly well, especially when calls are batched.

Calcluate Bins for `fires_df`

Start by calculating a bin for each wildfire using geoToH3. We do this batched to improve performance. We're using resolution 5.

H3 Resolution	Average Hexagon Area (km²)	Average Hexagon Edge Length (km)	Number of unique indexes
0	4,250,546.8477000	1,107.712591000	122
1	607,220.9782429	418.676005500	842
2	86,745.8540347	158.244655800	5,882
3	12,392.2648621	59.810857940	41,162
4	1,770.3235517	22.606379400	288,122
5	252.9033645	8.544408276	2,016,842
6	36.1290521	3.229482772	14,117,882
7	5.1612932	1.220629759	98,825,162
8	0.7373276	0.461354684	691,776,122
9	0.1053325	0.174375668	4,842,432,842
10	0.0150475	0.065907807	33,897,029,882
11	0.0021496	0.024910561	237,279,209,162
12	0.0003071	0.009415526	1,660,954,464,122
13	0.0000439	0.003559893	11,626,681,248,842
14	0.0000063	0.001348575	81,386,768,741,882
15	0.0000009	0.000509713	569,707,381,193,162

def h3_for_chunk(chunk, precision):
    lat_lon_lines = "\n".join([
        "{} {}".format(row['LATITUDE'], row['LONGITUDE']) for index,row in chunk.iterrows()
    ])
    h3 = subprocess.run(
        ['/tools/h3/bin/geoToH3', str(precision)],
        input=str.encode(lat_lon_lines),
        stdout=subprocess.PIPE).stdout.decode('utf-8').splitlines()
    return pd.DataFrame({
        "H3": h3
    }, index=chunk.index)

def chunker(seq, size):
    return (seq.iloc[pos:pos + size] for pos in range(0, len(seq), size))

for key in fires_df:
    print('Calculating bins for fires_df["{}"]'.format(key))
    h3_df = pd.DataFrame()
    with tqdm_notebook(total=fires_df[key].shape[0]) as pbar:
        for chunk in chunker(fires_df[key], 10000):
            h3_df = h3_df.append(h3_for_chunk(chunk, 5))
            pbar.update(chunk.shape[0])
    fires_df[key].drop('H3', 1, inplace=True, errors='ignore')
    fires_df[key] = fires_df[key].merge(
        h3_df,
        how='left',
        left_index=True,
        right_index=True,
        copy=False
    )
    display(fires_df[key].sample(5))

Calculating bins for fires_df["train"]

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	STATE_CODE	STAT_CAUSE_CODE_ZERO	DISCOVERY_WEEK	CLIMATE_REGION	CLIMATE_REGION_CODE	H3

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	STATE_CODE	STAT_CAUSE_CODE_ZERO	DISCOVERY_WEEK	CLIMATE_REGION	CLIMATE_REGION_CODE	H3
601688	5.0	Debris Burning	14.0	MISSING/NOT SPECIFIED	2452544.5	270	None	31.677220	-94.518610	TX	41	4	39	upper_midwest	7	85446bc7fffffff
615075	5.0	Debris Burning	14.0	MISSING/NOT SPECIFIED	2453447.5	77	None	33.083330	-94.300550	TX	41	4	12	upper_midwest	7	85444d0bfffffff
1254933	5.0	Debris Burning	8.0	PRIVATE	2452145.5	236	1405	33.229500	-82.787100	GA	9	4	34	southeast	5	8544c06bfffffff
1302225	7.0	Arson	14.0	MISSING/NOT SPECIFIED	2454212.5	112	1116	42.788903	-75.964293	NY	32	6	17	northeast	0	852b8b43fffffff
1511283	7.0	Arson	14.0	MISSING/NOT SPECIFIED	2455642.5	81	1608	31.047471	-85.837722	AL	0	6	12	southeast	5	8544e18bfffffff

Calculating bins for fires_df["test"]

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	STATE_CODE	STAT_CAUSE_CODE_ZERO	DISCOVERY_WEEK	CLIMATE_REGION	CLIMATE_REGION_CODE	H3

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	STATE_CODE	STAT_CAUSE_CODE_ZERO	DISCOVERY_WEEK	CLIMATE_REGION	CLIMATE_REGION_CODE	H3
85133	4.0	Campfire	5.0	USFS	2450280.5	198	1300	39.121667	-120.436667	CA	3	3	29	west	8	8528324ffffffff
945685	8.0	Children	14.0	MISSING/NOT SPECIFIED	2450529.5	81	None	37.483300	-77.616700	VA	43	7	12	southeast	5	852a8dd7fffffff
1017024	5.0	Debris Burning	14.0	MISSING/NOT SPECIFIED	2449820.5	103	None	35.708300	-81.561700	NC	25	4	15	southeast	5	8544db97fffffff
1145718	5.0	Debris Burning	8.0	PRIVATE	2449504.5	152	None	42.532810	-123.148780	OR	35	4	22	northwest	2	852818dbfffffff
1617233	7.0	Arson	8.0	PRIVATE	2455983.5	57	1315	35.597400	-82.625850	NC	25	6	9	southeast	5	8544d983fffffff

Isolate Bins

Identify unique bins in the wildfires dataframe and create a new GeoDataFrame, bins, containing only the bins. GeoDataFrame comes from the GeoPandas project that adds a layer of geographic processing on top of Pandas. For our use, we're most concerned with the ability to store, and plot geographic shapes. Under the covers, GeoPandas uses the Shapely library for geometric analysis.

bins = gpd.GeoDataFrame({
    'H3': np.unique(np.concatenate((fires_df["train"].H3.unique(), fires_df["test"].H3.unique())))
})

print(bins.info())
bins.sample(5)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 25999 entries, 0 to 25998
Data columns (total 1 columns):
H3    25999 non-null object
dtypes: object(1)
memory usage: 203.2+ KB
None

	H3
3543	85266c37fffffff
7418	8526eabbfffffff
9713	85279033fffffff
19060	852b1e3bfffffff
20399	85445253fffffff

Calculate Bounds

Calculate bounds for each bin in bins. This should batched as well - it takes quite a while to run.

def h3_bounds_for_bin(h3_bin):
    h3_bounds = subprocess.run(
        ['/tools/h3/bin/h3ToGeoBoundary'],
        input=str.encode(h3_bin),
        stdout=subprocess.PIPE).stdout.decode('utf-8').splitlines()
    coords = []
    for coord in h3_bounds[2:-1]:
        lon_lat = coord.lstrip().split(" ")
        coords.append(((float(lon_lat[1]) - 360.), float(lon_lat[0])))
    return shapely.geometry.Polygon(coords)

bins['GEOMETRY'] = bins['H3'].apply(h3_bounds_for_bin)

print(bins.info())
bins.sample(10)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 2 columns):
H3          27099 non-null object
GEOMETRY    27099 non-null object
dtypes: object(2)
memory usage: 423.5+ KB
None

	H3	GEOMETRY
1670	8526242bfffffff	POLYGON ((-95.19630366400003 46.585965214, -95...
3552	85265ed3fffffff	POLYGON ((-92.729148119 37.080579662, -92.6569...
4800	85268eb3fffffff	POLYGON ((-105.272718304 38.958513793, -105.20...
10585	852793c3fffffff	POLYGON ((-110.871721099 46.713100841, -110.80...
11178	85280243fffffff	POLYGON ((-123.003254979 40.831762, -122.94977...
18725	852aa0affffffff	POLYGON ((-77.10954865399998 41.466234529, -77...
19727	852af2affffffff	POLYGON ((-76.58726582000003 36.178472701, -76...
19912	852b1857fffffff	POLYGON ((-67.882502542 44.775920586, -68.0000...
24820	85488b83fffffff	POLYGON ((-101.943425426 31.173151717, -101.87...
26521	8548da13fffffff	POLYGON ((-104.892143771 35.715818432, -104.82...

Now we have a geodataframe, bins, that contains all of the bins whose geography contains at least one wildfire.

Land Cover Classification

As discussed above, land cover has a direct effect in the occurrence of wildfires. In order to incorporate land cover into our model, we'll engineer a feature based on DOI & USGS's Multiple Resolution Land Cover National Land Cover Database 2011. The data comes in a raster format with a 30 meter resolution. Each pixel can be one of the 20 classes described in the table to the right. A Land Use And Land Cover Classification System For Use With Remote Sensor Data provides some interesting context for land cover classification.

We'll use Mapbox's rasterio and rasterstats to extract land cover data about each of our bins. That data will come as a dict containing a count of points for each land cover class. We'll transform that data into a format more useful for our model.

The fact that we're using land cover data from 2011 alone threatens the value of this feature. While annual land cover data is not available, there is data available from 2001, and 2006. An improvement would be to select data from each of the three sets, 2001, 2006, or 2011, as appropriate.

We'll start by making a copy of our bins dataframe so that we can experiment without trashing upstream work.

bins_land_cover = bins.copy()

Let's define the path to our data file.

land_cover_path = "/data/188-million-us-wildfires/src/nlcd_2011_landcover_2011_edition_2014_10_10/nlcd_2011_landcover_2011_edition_2014_10_10.img"

Define Coordinate Transformation

Before we use rasterstats to count the values in each bin, we use pyproj to reproject project our bin's geometry into the CRS of our datasource.

project = functools.partial(
    pyproj.transform,
    pyproj.Proj(init='EPSG:4326'),
    pyproj.Proj('+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'))

Raster Samples

Before we extract information from our source raster, let's preview the contents to sanity check.

land_cover_data = rasterio.open(land_cover_path)

def render_masked_bin(row):
    return rasterio.mask.mask(
        land_cover_data,
        [shapely.geometry.mapping(transform(project, row['GEOMETRY']))],
        crop=True
    )

masked_bin_images = bins_land_cover.sample(6).apply(render_masked_bin, 1).tolist()

fig, plots = plt.subplots(3, 2)
for index, val in enumerate(masked_bin_images):
    my_plot = plots[math.floor(index/2)][index % 2]
    rasterio.plot.show(val[0], ax=my_plot, transform=val[1], cmap='gist_earth')
    my_plot.axis('on')

png

Calculate Land Cover

Now that we've confirmed that the data we're masking from our source raster looks reasonable, let's get to extracting it. Let's employ zonal_stats with categorial=True - this will returns out dict of counts for each value found in the masked region.

land_cover_path = "/data/188-million-us-wildfires/src/nlcd_2011_landcover_2011_edition_2014_10_10/nlcd_2011_landcover_2011_edition_2014_10_10.img"


def land_cover_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(transform(project, geometry))],
        land_cover_path,
        nodata=-999,
        categorical=True
    )
    return stats[0]

bins_land_cover = bins.copy()
bins_land_cover["LAND_COVER"] = bins_land_cover["GEOMETRY"].apply(land_cover_for_geometry)

print(bins_land_cover.info())
bins_land_cover.sample(5)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 3 columns):
H3            27099 non-null object
LAND_COVER    27099 non-null object
GEOMETRY      27099 non-null object
dtypes: object(3)
memory usage: 635.2+ KB
None

	H3	LAND_COVER	GEOMETRY
6266	8526b46bfffffff	{'11': 2814, '12': 448, '31': 1957, '41': 6709...	POLYGON ((-109.733245976 43.359435343, -109.66...
8556	85270b1bfffffff	{'11': 30430, '21': 6245, '22': 3832, '23': 11...	POLYGON ((-93.92304508000001 47.431408525, -94...
16644	8529ac17fffffff	{'21': 8414, '22': 188, '23': 7, '31': 3385, '...	POLYGON ((-119.761306269 34.848452511, -119.70...
24302	8544f17bfffffff	{'11': 977, '21': 24201, '22': 18629, '23': 42...	POLYGON ((-82.27732916000002 31.188086238, -82...
25823	8548c8a3fffffff	{'11': 62, '21': 946, '22': 1538, '23': 134, '...	POLYGON ((-110.416144017 34.938354721, -110.35...

Calculate Primary Class

Let's create a feature describing the primary land cover level II class for each bin.

def primary_land_cover(land_cover):
    return max(land_cover, key=land_cover.get)

bins_land_cover["PRIMARY_LAND_COVER"] = bins_land_cover["LAND_COVER"].apply(primary_land_cover)

print(bins_land_cover.info())
bins_land_cover.sample(5)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 4 columns):
H3                    27099 non-null object
LAND_COVER            27099 non-null object
GEOMETRY              27099 non-null object
PRIMARY_LAND_COVER    27099 non-null object
dtypes: object(4)
memory usage: 846.9+ KB
None

	H3	LAND_COVER	GEOMETRY	PRIMARY_LAND_COVER
8010	8526e9c3fffffff	{'11': 392, '21': 10943, '22': 2740, '23': 645...	POLYGON ((-94.26269021799999 35.502182409, -94...	41
8427	85270837fffffff	{'11': 21425, '21': 4510, '22': 857, '23': 235...	POLYGON ((-92.87331802199998 48.073667863, -92...	90
18892	852aa3c3fffffff	{'11': 3339, '21': 21052, '22': 7207, '23': 22...	POLYGON ((-78.347508672 41.131021516, -78.4469...	41
24658	854883d7fffffff	{'11': 177, '21': 8740, '22': 1837, '23': 61, ...	POLYGON ((-99.61025040200002 28.581772583, -99...	52
26656	8548dcb7fffffff	{'21': 1208, '22': 612, '23': 106, '31': 677, ...	POLYGON ((-106.971115621 34.134266323, -106.90...	52

Categorize Primary Class

Let's group each bin's primary land cover class into it's Level I class.

land_cover_categories = {
    "water": [11, 12],
    "developed": [21, 22, 23, 24],
    "barren": [31],
    "forest": [41, 42, 43],
    "shrubland": [51, 52],
    "herbaceous": [71, 72, 73, 74],
    "planted_cultivated": [81, 82],
    "wetlands": [90, 95]
}

land_cover_categories_map = {}
for category, values in land_cover_categories.items():
    for value in values:
        land_cover_categories_map[value] = category

def land_cover_category_for_class(land_cover_class):
    return land_cover_categories_map[int(land_cover_class)] if int(land_cover_class) in land_cover_categories_map else 'unknown'

bins_land_cover["PRIMARY_LAND_COVER_CATEGORY"] = bins_land_cover["PRIMARY_LAND_COVER"].apply(land_cover_category_for_class)

print(bins_land_cover.info())
bins_land_cover.sample(5)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 5 columns):
H3                             27099 non-null object
LAND_COVER                     27099 non-null object
GEOMETRY                       27099 non-null object
PRIMARY_LAND_COVER             27099 non-null object
PRIMARY_LAND_COVER_CATEGORY    27099 non-null object
dtypes: object(5)
memory usage: 1.0+ MB
None

	H3	LAND_COVER	GEOMETRY	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY
1727	8526250ffffffff	{'11': 693, '21': 9651, '22': 2978, '23': 942,...	POLYGON ((-94.09905895100002 46.105765621, -94...	81	planted_cultivated
2062	85262e4bfffffff	{'11': 4610, '21': 13418, '22': 1594, '23': 44...	POLYGON ((-95.42043053700002 44.125116576, -95...	82	planted_cultivated
9264	85275093fffffff	{'11': 14071, '21': 10506, '22': 254, '23': 19...	POLYGON ((-90.94032905199998 46.288750521, -91...	90	wetlands
12075	85283457fffffff	{'11': 2115, '21': 14075, '22': 2299, '23': 57...	POLYGON ((-121.731573885 37.047433286, -121.67...	43	forest
21200	85444d57fffffff	{'11': 1358, '21': 11153, '22': 4413, '23': 17...	POLYGON ((-94.09936341299999 32.872628429, -94...	42	forest

Encode

Employ LabelEncoder to assign a numeric value to each category.

for key, dataframe in fires_df.items():
    label_encoder = sklearn.preprocessing.LabelEncoder()
    bins_land_cover.loc[:,'PRIMARY_LAND_COVER_CODE'] = label_encoder.fit_transform(bins_land_cover['PRIMARY_LAND_COVER'])
    label_encoder = sklearn.preprocessing.LabelEncoder()
    bins_land_cover.loc[:,'PRIMARY_LAND_COVER_CATEGORY_CODE'] = label_encoder.fit_transform(bins_land_cover['PRIMARY_LAND_COVER_CATEGORY'])

print(bins_land_cover.info())
bins_land_cover.sample(5)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 7 columns):
H3                                  27099 non-null object
LAND_COVER                          27099 non-null object
GEOMETRY                            27099 non-null object
PRIMARY_LAND_COVER                  27099 non-null object
PRIMARY_LAND_COVER_CATEGORY         27099 non-null object
PRIMARY_LAND_COVER_CODE             27099 non-null int64
PRIMARY_LAND_COVER_CATEGORY_CODE    27099 non-null int64
dtypes: int64(2), object(5)
memory usage: 1.4+ MB
None

	H3	LAND_COVER	GEOMETRY	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CODE	PRIMARY_LAND_COVER_CATEGORY_CODE
12157	8528806bfffffff	{'11': 250, '31': 5, '41': 32, '42': 178500, '...	POLYGON ((-115.591008981 44.7829381, -115.5268...	42	forest	8	2
12794	85289163fffffff	{'11': 981, '21': 3125, '22': 225, '23': 21, '...	POLYGON ((-113.068429643 45.777292588, -113.00...	42	forest	8	2
13156	85289a03fffffff	{'11': 296, '31': 10575, '42': 187046, '43': 4...	POLYGON ((-113.206564175 47.439544808, -113.13...	42	forest	8	2
13345	85289dc3fffffff	{'11': 1694, '21': 3039, '22': 95, '23': 13, '...	POLYGON ((-115.582331357 46.102164083, -115.51...	42	forest	8	2
22925	8544cb47fffffff	{'11': 448, '21': 13178, '22': 5027, '23': 156...	POLYGON ((-84.00949905599998 36.478531766, -83...	41	forest	7	2

Land Cover Percentage

After all the above categorizing, let's make some continuous values. We'll create a new column for each land cover class in each bin that describes how much of that bin is covered by that class. Since we'll have some null values in the resulting dataset, let's make sure to set those to 0.

drop_cols = [c for c in bins_land_cover.columns if c[:15] == "LAND_COVER_PCT_"]
bins_land_cover = bins_land_cover.drop(drop_cols, 1)

def land_cover_percentages(row):
    total = 0
    for lc_class, val in row['LAND_COVER'].items():
        total = total + int(val)
    output = {}
    for lc_class, val in row['LAND_COVER'].items():
        pct = val / total
        row['LAND_COVER_PCT_{}'.format(lc_class)] = pct
    return row

bins_land_cover = bins_land_cover.apply(land_cover_percentages, 1)

bins_land_cover.loc[
    :,[c for c in bins_land_cover.columns if c[:15] == "LAND_COVER_PCT_"]
] = bins_land_cover[[c for c in bins_land_cover.columns if c[:15] == "LAND_COVER_PCT_"]].fillna(0)

print(bins_land_cover.info())
bins_land_cover.sample(5)

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 7 columns):
H3                                  27099 non-null object
LAND_COVER                          27099 non-null object
PRIMARY_LAND_COVER                  27099 non-null object
PRIMARY_LAND_COVER_CATEGORY         27099 non-null object
PRIMARY_LAND_COVER_CATEGORY_CODE    27099 non-null int64
PRIMARY_LAND_COVER_CODE             27099 non-null int64
GEOMETRY                            27099 non-null object
dtypes: int64(2), object(5)
memory usage: 1.4+ MB
None

	H3	LAND_COVER	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CATEGORY_CODE	PRIMARY_LAND_COVER_CODE	GEOMETRY
7857	8526e46ffffffff	{'11': 1041, '21': 10898, '22': 1571, '23': 32...	82	planted_cultivated	4	13	POLYGON ((-94.99411935400002 38.81449329, -94....
10184	852783b7fffffff	{'11': 47, '21': 800, '22': 52, '31': 1537, '4...	71	herbaceous	3	11	POLYGON ((-103.574894942 45.204828189, -103.50...
14779	8528f03bfffffff	{'11': 20645, '21': 14749, '22': 7655, '23': 2...	43	forest	2	9	POLYGON ((-123.601736884 46.320872208, -123.54...
18093	852a8b83fffffff	{'11': 1779, '21': 8729, '22': 2660, '23': 769...	41	forest	2	7	POLYGON ((-78.600277807 37.661639755, -78.6924...
26470	8548d84ffffffff	{'11': 10, '21': 1263, '22': 75, '31': 23, '41...	42	forest	2	8	POLYGON ((-105.458776659 35.348940866, -105.39...

Merge onto `fires_df`

Merge our bins_land_cover onto our wildfire data keying on the H3 bin column.

for key, dataframe in fires_df.items():
    fires_df[key] = dataframe.merge(bins_land_cover, on="H3", how="left")
    display(fires_df[key].sample(5))

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	...	DISCOVERY_WEEK	CLIMATE_REGION	CLIMATE_REGION_CODE	H3	LAND_COVER	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CATEGORY_CODE	PRIMARY_LAND_COVER_CODE	GEOMETRY

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	...	DISCOVERY_WEEK	CLIMATE_REGION	CLIMATE_REGION_CODE	H3	LAND_COVER	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CATEGORY_CODE	PRIMARY_LAND_COVER_CODE	GEOMETRY
324193	7.0	Arson	14.0	MISSING/NOT SPECIFIED	2451883.5	340	None	35.077500	-81.768100	SC	...	49	southeast	5	8544d8bbfffffff	{'11': 3373, '21': 25890, '22': 7596, '23': 22...	41	forest	2	7	POLYGON ((-80.84926178299997 35.39708965, -80....
385381	1.0	Lightning	1.0	BLM	2451821.5	278	1250	36.167500	-115.538900	NV	...	40	west	8	852986b7fffffff	{'11': 8, '21': 870, '22': 239, '23': 4, '31':...	52	shrubland	5	10	POLYGON ((-118.837357878 38.868465355, -118.78...
827733	2.0	Equipment Use	8.0	PRIVATE	2452828.5	189	1204	42.854400	-112.013600	ID	...	28	northwest	2	8528b203fffffff	{'21': 1209, '22': 1, '31': 6, '41': 34553, '4...	52	shrubland	5	10	POLYGON ((-111.859646108 42.561970602, -111.79...
953964	2.0	Equipment Use	14.0	MISSING/NOT SPECIFIED	2453227.5	223	None	38.720000	-120.731944	CA	...	32	west	8	852832cffffffff	{'11': 1645, '21': 14866, '22': 3334, '23': 56...	42	forest	2	8	POLYGON ((-122.09915726 37.494361459, -122.046...
1010417	4.0	Campfire	5.0	USFS	2451812.5	269	1200	35.103056	-111.540833	AZ	...	39	southwest	6	8548cd2ffffffff	{'11': 3208, '21': 1196, '22': 237, '23': 17, ...	42	forest	2	8	POLYGON ((-110.797864332 34.294343747, -110.73...

5 rows × 22 columns

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	...	DISCOVERY_WEEK	CLIMATE_REGION	CLIMATE_REGION_CODE	H3	LAND_COVER	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CATEGORY_CODE	PRIMARY_LAND_COVER_CODE	GEOMETRY

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	...	DISCOVERY_WEEK	CLIMATE_REGION	CLIMATE_REGION_CODE	H3	LAND_COVER	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CATEGORY_CODE	PRIMARY_LAND_COVER_CODE	GEOMETRY
689	2.0	Equipment Use	1.0	BLM	2448868.5	247	1047	34.700000	-117.167500	CA	...	36	west	8	8529a3d7fffffff	{'21': 913, '22': 172, '23': 35, '31': 1615, '...	52	shrubland	5	10	POLYGON ((-116.50303393 32.759248771, -116.446...
61411	7.0	Arson	12.0	MUNICIPAL/LOCAL	2452674.5	35	None	34.200500	-95.299833	OK	...	6	upper_midwest	7	8526cd83fffffff	{'11': 396, '21': 8681, '22': 449, '23': 19, '...	41	forest	2	7	POLYGON ((-97.09674050699999 34.066314039, -97...
83447	1.0	Lightning	1.0	BLM	2453964.5	229	1412	42.964600	-118.326900	OR	...	33	northwest	2	8528aec3fffffff	{'11': 2205, '31': 22, '42': 2988, '52': 27022...	52	shrubland	5	10	POLYGON ((-113.932881084 42.180656872, -113.86...
141424	3.0	Smoking	14.0	MISSING/NOT SPECIFIED	2452647.5	8	None	29.166670	-82.200000	FL	...	2	southeast	5	8544f413fffffff	{'11': 152, '21': 46492, '22': 25796, '23': 92...	42	forest	2	8	POLYGON ((-81.44679156199999 29.448254503, -81...
269557	7.0	Arson	14.0	MISSING/NOT SPECIFIED	2451393.5	215	1532	34.683344	-89.124754	MS	...	31	upper_midwest	7	85445b7bfffffff	{'11': 3514, '21': 9513, '22': 3877, '23': 124...	41	forest	2	7	POLYGON ((-90.98733658999998 32.724780019, -90...

5 rows × 22 columns

Alas, we now have a handful of features related to Land Cover on our main dataframe. Let's move onto some other features!

Census Data

In addition to land cover, human behavior plays a large role the the creation of wildfires. Similarly to how we engineered features around Land Cover, we can bring in an external dataset to engineer some features around population and human behavior. Fortunately, it's possible to obtain 2010 US Census Data in a raster format allowing us to employ the the same techniques that we did for land cover.

US Census publishes a number of interesting metrics: for our purposes, we will look at Population Density, Seasonal Housing, Housing Units, and Occupied Housing Units. For each of these, we will look at the mean value for each bin.

Similarly to the Land Cover data, this method is flawed in it's use of data from 2010. Since the wildfires we're looking at occurred in the years 1995-2015, 2010 census data may not be highly relevant to some of the earlier or later datapoints.

As with Land Cover, start by creating a new dataframe to house or work with the census data.

census_data = pd.DataFrame(
    {"H3": bins['H3'].copy()},
    index=bins.index
)

Raster Samples

Before we extract information from our source raster, let's preview the contents to sanity check.

census_files = {
    "Population Density": "/data/188-million-us-wildfires/src/usgrid_data_2010/uspop10.tif",
    "Seasonal Housing": "/data/188-million-us-wildfires/src/usgrid_data_2010/ussea10.tif",
    "Housing Units": "/data/188-million-us-wildfires/src/usgrid_data_2010/ushu10.tif",
    "Occupied Housing Units": "/data/188-million-us-wildfires/src/usgrid_data_2010/usocc10.tif"
}

def render_bin(row):
    output = []
    for index, value in census_files.items():
        census_data = rasterio.open(value)
        output.append(rasterio.mask.mask(
            census_data,
            [shapely.geometry.mapping(row['GEOMETRY'])],
            crop=True,
            nodata=0
        ))
    return output

masked_bin_images = bins_land_cover.sample(2).apply(render_bin, 1).tolist()

fig, plots = plt.subplots(4, 2)
for index, val in enumerate(masked_bin_images):
    for subindex, image in enumerate(val):
        my_plot = plots[subindex][index]
        rasterio.plot.show(image[0], ax=my_plot, transform=image[1], cmap='inferno')

png

Looks reasonable, let's create our features!

Population Density

Start with Population Density, use zonal_stats to extract the mean value for each bin. Add the values as a new column to our dataframe.

census_population_path = "/data/188-million-us-wildfires/src/usgrid_data_2010/uspop10.tif"

def mean_population_density_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(geometry)],
        census_population_path,
        nodata=0,
        stats=["mean"]
    )
    return stats[0]["mean"]

census_data = census_data.drop("MEAN_POPULATION_DENSITY", 1, errors='ignore')
census_data["MEAN_POPULATION_DENSITY"] = \
    bins["GEOMETRY"].apply(mean_population_density_for_geometry).fillna(value=0).replace([-np.inf], 0)

print(census_data.info())
census_data.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 2 columns):
H3                         27099 non-null object
MEAN_POPULATION_DENSITY    27099 non-null float64
dtypes: float64(1), object(1)
memory usage: 423.5+ KB
None

	H3	MEAN_POPULATION_DENSITY
182	8512da2bfffffff	0.000000e+00
11505	85281523fffffff	1.346886e+00
11892	8528311bfffffff	0.000000e+00
14506	8528d243fffffff	-9.426104e+35
16105	85299d47fffffff	0.000000e+00

Seasonal Housing

census_seasonal_path = "/data/188-million-us-wildfires/src/usgrid_data_2010/ussea10.tif"

def mean_seasonal_housing_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(geometry)],
        census_seasonal_path,
        nodata=0,
        stats=["mean"]
    )
    return stats[0]["mean"]

census_data = census_data.drop("MEAN_SEASONAL_HOUSING", 1, errors='ignore')
census_data["MEAN_SEASONAL_HOUSING"] = \
    bins["GEOMETRY"].apply(mean_seasonal_housing_for_geometry).fillna(value=0).replace([-np.inf], 0)

print(census_data.info())
census_data.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 3 columns):
H3                         27099 non-null object
MEAN_POPULATION_DENSITY    27099 non-null float64
MEAN_SEASONAL_HOUSING      27099 non-null float64
dtypes: float64(2), object(1)
memory usage: 635.2+ KB
None

	H3	MEAN_POPULATION_DENSITY	MEAN_SEASONAL_HOUSING
9871	8527704ffffffff	0.000000	0.000000
11226	8528063bfffffff	1.233903	0.260069
13150	85289a8ffffffff	1.460237	0.939296
19559	852ad463fffffff	27.189120	0.147108
23525	8544dd5bfffffff	37.240550	0.248122

Housing Units

census_housing_units = "/data/188-million-us-wildfires/src/usgrid_data_2010/ushu10.tif"

def mean_housing_units_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(geometry)],
        census_housing_units,
        nodata=0,
        stats=["mean"]
    )
    return stats[0]["mean"]

census_data = census_data.drop("MEAN_HOUSING_UNITS", 1, errors='ignore')
census_data["MEAN_HOUSING_UNITS"] = \
    bins["GEOMETRY"].apply(mean_housing_units_for_geometry).fillna(value=0).replace([-np.inf], 0)

print(census_data.info())
census_data.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 4 columns):
H3                         27099 non-null object
MEAN_POPULATION_DENSITY    27099 non-null float64
MEAN_SEASONAL_HOUSING      27099 non-null float64
MEAN_HOUSING_UNITS         27099 non-null float64
dtypes: float64(3), object(1)
memory usage: 846.9+ KB
None

	H3	MEAN_POPULATION_DENSITY	MEAN_SEASONAL_HOUSING	MEAN_HOUSING_UNITS
3010	8526518bfffffff	7.562320	0.087591	3.296344
8780	8527402bfffffff	0.000000	0.000000	0.000000
13867	8528ab9bfffffff	0.030536	0.058152	0.032559
17224	852a1483fffffff	441.285687	1.192067	183.367477
17269	852a1613fffffff	4.882109	1.034020	3.289251

Occupied Housing Units

census_occupied_housing_units = "/data/188-million-us-wildfires/src/usgrid_data_2010/usocc10.tif"

def mean_occupied_housing_units_for_geometry(geometry):
    stats = zonal_stats(
        [shapely.geometry.mapping(geometry)],
        census_occupied_housing_units,
        nodata=0,
        stats=["mean"]
    )
    return stats[0]["mean"]

census_data = census_data.drop("MEAN_OCCUPIED_HOUSING_UNITS", 1, errors='ignore')
census_data["MEAN_OCCUPIED_HOUSING_UNITS"] = \
    bins["GEOMETRY"]\
        .apply(mean_occupied_housing_units_for_geometry)\
        .fillna(value=0)\
        .replace([-np.inf], 0)

print(census_data.info())
census_data.sample(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27099 entries, 0 to 27098
Data columns (total 5 columns):
H3                             27099 non-null object
MEAN_POPULATION_DENSITY        27099 non-null float64
MEAN_SEASONAL_HOUSING          27099 non-null float64
MEAN_HOUSING_UNITS             27099 non-null float64
MEAN_OCCUPIED_HOUSING_UNITS    27099 non-null float64
dtypes: float64(4), object(1)
memory usage: 1.0+ MB
None

	H3	MEAN_POPULATION_DENSITY	MEAN_SEASONAL_HOUSING	MEAN_HOUSING_UNITS	MEAN_OCCUPIED_HOUSING_UNITS
4213	85268093fffffff	0.060832	0.141386	0.169451	0.169451
5865	8526aa47fffffff	29.909779	0.240654	10.667183	10.667183
8087	8526ec57fffffff	1.851572	0.138690	0.763857	0.763857
21173	85444d37fffffff	0.000000	0.000000	0.000000	0.000000
23738	8544e2abfffffff	25.395138	0.242860	10.882212	10.882212

Vacant Housing Units

Let's compare overall Housing Units and Occupied Housing Units to create a feature around vacant housing units.

census_data["MEAN_VACANT_HOUSING_UNITS"] = census_data["MEAN_HOUSING_UNITS"] - census_data["MEAN_OCCUPIED_HOUSING_UNITS"]

Merge onto `fires_df`

Let's merge our Census features onto our main data frame.

for key, dataframe in fires_df.items():
    fires_df[key] = fires_df[key]\
        .drop([
            "MEAN_POPULATION_DENSITY",
            "MEAN_SEASONAL_HOUSING",
            "MEAN_HOUSING_UNITS",
            "MEAN_OCCUPIED_HOUSING_UNITS",
            "MEAN_VACANT_HOUSING_UNITS"
        ], axis=1, errors='ignore')
    fires_df[key] = dataframe.merge(census_data, on="H3", how="left")
    display(fires_df[key].sample(5))

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	...	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CATEGORY_CODE	PRIMARY_LAND_COVER_CODE	GEOMETRY	MEAN_POPULATION_DENSITY	MEAN_SEASONAL_HOUSING	MEAN_HOUSING_UNITS	MEAN_OCCUPIED_HOUSING_UNITS	MEAN_VACANT_HOUSING_UNITS

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	...	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CATEGORY_CODE	PRIMARY_LAND_COVER_CODE	GEOMETRY	MEAN_POPULATION_DENSITY	MEAN_SEASONAL_HOUSING	MEAN_HOUSING_UNITS	MEAN_OCCUPIED_HOUSING_UNITS
190846	5.0	Debris Burning	14.0	MISSING/NOT SPECIFIED	2456724.5	67	None	31.565195	-94.676084	TX	...	42	forest	2	8	POLYGON ((-95.53469668000002 30.461285287, -95...	-8.861519e+35	-1.189798e+36	-8.861519e+35	-8.861519e+35
236781	7.0	Arson	8.0	PRIVATE	2452523.5	249	2022	34.023425	-85.147918	GA	...	41	forest	2	7	POLYGON ((-83.54737807800001 34.783248074, -83...	4.844742e+01	2.591060e-01	2.042438e+01	2.042438e+01
392853	2.0	Equipment Use	8.0	PRIVATE	2450414.5	332	1510	32.732400	-82.682700	GA	...	42	forest	2	8	POLYGON ((-82.16565888299999 33.140867951, -82...	1.526929e+01	2.266859e-01	4.978497e+00	4.978497e+00
762546	5.0	Debris Burning	8.0	PRIVATE	2449294.5	307	1200	31.806500	-81.653100	GA	...	90	wetlands	8	14	POLYGON ((-82.56952801199998 30.150289017, -82...	1.031592e+02	5.569562e-01	4.348338e+01	4.348338e+01
953893	6.0	Railroad	8.0	PRIVATE	2449374.5	22	1540	31.515200	-81.976300	GA	...	90	wetlands	8	14	POLYGON ((-81.13579223099998 29.521477031, -81...	1.595580e+01	3.028798e-01	4.464921e+00	4.464921e+00

5 rows × 27 columns

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	...	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CATEGORY_CODE	PRIMARY_LAND_COVER_CODE	GEOMETRY	MEAN_POPULATION_DENSITY	MEAN_SEASONAL_HOUSING	MEAN_HOUSING_UNITS	MEAN_OCCUPIED_HOUSING_UNITS	MEAN_VACANT_HOUSING_UNITS

	STAT_CAUSE_CODE	STAT_CAUSE_DESCR	OWNER_CODE	OWNER_DESCR	DISCOVERY_DATE	DISCOVERY_DOY	DISCOVERY_TIME	LATITUDE	LONGITUDE	STATE	...	PRIMARY_LAND_COVER	PRIMARY_LAND_COVER_CATEGORY	PRIMARY_LAND_COVER_CATEGORY_CODE	PRIMARY_LAND_COVER_CODE	GEOMETRY	MEAN_POPULATION_DENSITY	MEAN_SEASONAL_HOUSING	MEAN_HOUSING_UNITS	MEAN_OCCUPIED_HOUSING_UNITS
173974	1.0	Lightning	5.0	USFS	2449234.5	247	2030	38.996667	-112.106667	UT	...	42	forest	2	8	POLYGON ((-112.634109074 36.682798716, -112.57...	0.000000	0.048080	0.048080	0.048080
180091	5.0	Debris Burning	8.0	PRIVATE	2456756.5	99	1635	32.981111	-81.753889	GA	...	82	planted_cultivated	4	13	POLYGON ((-83.64038666099998 31.710417745, -83...	8.222949	0.186019	3.652465	3.652465
234596	5.0	Debris Burning	14.0	MISSING/NOT SPECIFIED	2453825.5	90	None	34.279960	-78.734420	NC	...	82	planted_cultivated	4	13	POLYGON ((-77.93933100999999 35.494919175, -78...	34.872965	0.330970	15.151135	15.151135
266232	4.0	Campfire	5.0	USFS	2452048.5	139	1300	38.885000	-120.033333	CA	...	42	forest	2	8	POLYGON ((-116.705052106 39.732564363, -116.64...	0.000000	0.000000	0.000000	0.000000
316901	5.0	Debris Burning	14.0	MISSING/NOT SPECIFIED	2455874.5	313	1320	43.125472	-75.218345	NY	...	41	forest	2	7	POLYGON ((-74.05886090500002 42.510741585, -74...	54.448256	0.584297	21.798168	21.798168

5 rows × 27 columns

Conclusion

Well that's all for now. In this blog post we engineered a handful of features both using data already present in our base dataset, and by incorporating new datasets. We geographically binned our datapoints, and used the geometry of each bin to extract statistics from land cover and human behavior raster files.

There is a lot that can be done to improve this - if you think there's something that could be approached differently, I want to hear from you! Please reach out: andrewmahon@fastmail.com.

Next post we'll revisit our Exploratory Data Analysis with a focus on our new features.

Updates

2017/2/14 - initial post 2017/2/22 - fixed a couple of typos

Andrew Mahon