Session 1: Importing various spatio-temporal data formats

H. Sherry Zhang and Prof Di Cook

2022-12-06

Roadmap

  • Spatial/ spatio-temporal data format
    • shape files
    • tabular data
    • NetCDF
  • 3 lessons I learnt from working on spatio-temporal data
    • rendering geom_sf takes too long
    • Error ! st_crs(x) == st_crs(y)
    • duplicates, duplicates, duplicates

Spatio-temporal data

People can talk about a whole range of differnt things when they only refer to their data as spatio-temporal!

The focus of today will be mostly on vector data, but we will also touch on raster slightly.

Data format

  • Shape files
  • Tabular data:
    • table with longitude, latitude columns
    • long/ wide table of time series
  • NetCDF data

Spatial vector data: shape files (.shp)

The function sf::st_read() will read a shape file into a simple feature object.

This is how a simple feature object looks like for Australia map data:

ozmaps::abs_ste
Simple feature collection with 9 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 96.81703 ymin: -43.65856 xmax: 167.9969 ymax: -9.219937
Geodetic CRS:  GDA94
                          NAME                       geometry
1              New South Wales MULTIPOLYGON (((151.1741 -3...
2                     Victoria MULTIPOLYGON (((146.7456 -3...
3                   Queensland MULTIPOLYGON (((148.6266 -2...
4              South Australia MULTIPOLYGON (((133.1319 -3...
5            Western Australia MULTIPOLYGON (((114.4178 -2...
6                     Tasmania MULTIPOLYGON (((148.3115 -4...
7           Northern Territory MULTIPOLYGON (((136.5921 -1...
8 Australian Capital Territory MULTIPOLYGON (((149.2317 -3...
9            Other Territories MULTIPOLYGON (((96.87821 -1...

Plot it with geom_sf()

ozmaps::abs_ste %>% 
  ggplot() + 
  geom_sf(color = "white", fill = "grey90") + 
  theme_void()

Tabular data

Spatial table with long, lat columns:

stations
# A tibble: 30 × 6
  id            lat  long  elev name                       wmo_id
  <chr>       <dbl> <dbl> <dbl> <chr>                       <dbl>
1 ASN00060139 -31.4  153.   4.2 port macquarie airport aws  94786
2 ASN00068228 -34.4  151.  10   bellambi aws                94749
3 ASN00017123 -28.1  140.  37.8 moomba airport              95481
4 ASN00081049 -36.4  145. 114   tatura inst sustainable ag  95836
5 ASN00018201 -32.5  138.  14   port augusta aero           95666
6 ASN00012314 -27.8  121. 497   leinster aero               95448
7 ASN00036026 -24.3  144. 203   isisford post office        94345
# … with 23 more rows

long table of time series:

ts
# A tibble: 10,632 × 5
  id          date        prcp  tmax  tmin
  <chr>       <date>     <dbl> <dbl> <dbl>
1 ASN00003057 2020-01-01     0  36.7  26.9
2 ASN00003057 2020-01-02    41  34.2  24  
3 ASN00003057 2020-01-03     0  35    25.4
4 ASN00003057 2020-01-04    40  29.1  25.4
5 ASN00003057 2020-01-05  1640  27.3  24.3
6 ASN00003057 2020-01-06  1832  29.2  23.2
7 ASN00003057 2020-01-07    42  36    21  
# … with 10,625 more rows

Add stations on the base map:

ozmaps::abs_ste %>% 
  ggplot() + 
  geom_sf(color = "white", fill = "grey90") + 
  geom_point(data = stations, aes(x = long, y = lat)) + 
  theme_void()

Spatial raster data: NetCDF (.nc)

The Climate and Forecast (CF) Metadata Conventions prescribe how metadata should be documented for NetCDF objects:

path <- system.file("ncdf/era5-pressure.nc", package = "cubble")
(era5_ncdf <- ncdf4::nc_open(path))
File /Library/Frameworks/R.framework/Versions/4.1/Resources/library/cubble/ncdf/era5-pressure.nc (NC_FORMAT_64BIT):

     2 variables (excluding dimension variables):
        short q[longitude,latitude,time]   
            scale_factor: 2.09848696659051e-11
            add_offset: 3.16766314740189e-06
            _FillValue: -32767
            missing_value: -32767
            units: kg kg**-1
            long_name: Specific humidity
            standard_name: specific_humidity
        short z[longitude,latitude,time]   
            scale_factor: 0.154814177589917
            add_offset: 306067.078842911
            _FillValue: -32767
            missing_value: -32767
            units: m**2 s**-2
            long_name: Geopotential
            standard_name: geopotential

     3 dimensions:
        longitude  Size:161 
            units: degrees_east
            long_name: longitude
        latitude  Size:165 
            units: degrees_north
            long_name: latitude
        time  Size:6 
            units: hours since 1900-01-01 00:00:00.0
            long_name: time
            calendar: gregorian

    2 global attributes:
        Conventions: CF-1.6
        history: 2022-04-17 01:14:41 GMT by grib_to_netcdf-2.24.3: /opt/ecmwf/mars-client/bin/grib_to_netcdf -S param -o /cache/data8/adaptor.mars.internal-1650158081.486927-5264-6-90c05c05-ca42-4b7a-b6f8-d9617d0f0500.nc /cache/tmp/90c05c05-ca42-4b7a-b6f8-d9617d0f0500-adaptor.mars.internal-1650158080.1120768-5264-12-tmp.grib

Extract variables & dimensions

q <- ncdf4::ncvar_get(era5_ncdf, "q") 
class(q)
[1] "array"
dim(q)
[1] 161 165   6
long <- ncdf4::ncvar_get(era5_ncdf, "longitude")
length(long)
[1] 161
head(long)
[1] 113.00 113.25 113.50 113.75 114.00 114.25

Three moments I scratch my head when wrangling and visualising spatio-temporal data

Moment 1: when waiting for a simple geom_sf to render …

Moment 1

oz <- ozmaps::abs_ste
ggplot() +
  geom_sf(data = oz)

       step user.self sys.self elapsed
1 construct     0.002    0.000   0.002
2     build     0.093    0.005   0.098
3    render     0.063    0.005   0.070
4      draw     4.921    0.144   5.289
5     TOTAL     5.079    0.154   5.459
oz_simp <- oz %>% 
  rmapshaper::ms_simplify(keep = 0.05) # or 
# sf::st_simplify(dTolerance = 4000)
ggplot() + geom_sf(data = oz_simp)

       step user.self sys.self elapsed
1 construct     0.002    0.000   0.002
2     build     0.074    0.003   0.076
3    render     0.034    0.001   0.034
4      draw     1.496    0.575   3.089
5     TOTAL     1.606    0.579   3.201

Reflection

  • This is a simplified example, when working with real data, rendering could take hours!

  • Making sure your polygons are at reasonable size would also saves your time on some geocomputation:

    • i.e. st_union() combines feature geometries while solves the boundary issue (if two polygons share boundaries, they will be merged into one)
    • imagine you throw it thousands of polygons to union and wonder why it takes so long…

Moment 2: when filtering an sf object based on another …

Moment 2

stations_sf
Simple feature collection with 30 features and 6 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 114.0967 ymin: -41.8739 xmax: 152.8655 ymax: -11.6502
CRS:           NA
# A tibble: 30 × 7
  id            lat  long  elev name            wmo_id            geometry
  <chr>       <dbl> <dbl> <dbl> <chr>            <dbl>             <POINT>
1 ASN00060139 -31.4  153.   4.2 port macquarie…  94786 (152.8655 -31.4336)
2 ASN00068228 -34.4  151.  10   bellambi aws     94749 (150.9291 -34.3691)
3 ASN00017123 -28.1  140.  37.8 moomba airport   95481 (140.1956 -28.0997)
4 ASN00081049 -36.4  145. 114   tatura inst su…  95836 (145.2672 -36.4378)
5 ASN00018201 -32.5  138.  14   port augusta a…  95666 (137.7144 -32.5091)
6 ASN00012314 -27.8  121. 497   leinster aero    95448 (120.7031 -27.8386)
7 ASN00036026 -24.3  144. 203   isisford post …  94345 (144.4406 -24.2591)
# … with 23 more rows

Moment 2

victoria <- ozmaps::abs_ste %>% 
  filter(NAME == "Victoria")
stations_sf %>% 
  st_filter(victoria) %>% 
  nrow()
Error in `stopifnot()`:
! Can't compute `..1 = lengths(.predicate(x, y, ...)) > 0`.
Caused by error in `st_geos_binop()`:
! st_crs(x) == st_crs(y) is not TRUE
  • ozmaps::abs_ste
    • Geodetic CRS: GDA94
  • stations_sf
  • CRS: NA
# option 1
st_crs(stations_sf) <- 
  st_crs(victoria)

# option 2
stations_sf <- RAW_DATA %>% 
  st_as_sf(
    crs = st_crs(victoria), ...
    )

# option 3
stations_sf <- stations_sf %>% 
  st_set_crs(st_crs(victoria))
stations_sf %>% 
  st_filter(victoria) %>% 
  nrow()
[1] 4

Reflection

  • The problem itself here is not difficult to solve given the error message is informative, but behind it is the giant world of map projection:

    • unprojected map (long/lat) v.s. projected map
    • we tend to get spatial data in the long/lat format, but need to remember to switch to a projected map when operations involve computing distance and area

    try st_crs(ozmaps::abs_ste) if you’re brave

Reflection Cont.

Moment 3: when creating a tsibble out of my own data …

Moment 3

set.seed(123)
(harvest <- tibble(
   year = c(2010, 2011, 2013, 
            2011, 2012, 2014, 2014),
   fruit = c(rep(c("kiwi", "cherry"), 
                 each = 3),
             "cherry"),
   kilo = sample(1:10, size = 7)
 ))
# A tibble: 7 × 3
   year fruit   kilo
  <dbl> <chr>  <int>
1  2010 kiwi       3
2  2011 kiwi      10
3  2013 kiwi       2
4  2011 cherry     8
5  2012 cherry     6
6  2014 cherry     9
7  2014 cherry     1
harvest %>% 
  as_tsibble(
    key = fruit, 
    index = year)
Error in `validate_tsibble()`:
! A valid tsibble must have distinct rows identified by key and index.
ℹ Please use `duplicates()` to check the duplicated rows.

Data cleaning first :)

harvest %>% 
  duplicates(key = fruit, 
             index = year)
# A tibble: 2 × 3
   year fruit   kilo
  <dbl> <chr>  <int>
1  2014 cherry     9
2  2014 cherry     1
harvest %>% 
  dplyr::filter(!(year == 2014 & 
             fruit == "cherry" & 
             kilo == 9)) %>% 
  as_tsibble(
    key = fruit, 
    index = year)
# A tsibble: 6 x 3 [1Y]
# Key:       fruit [2]
   year fruit   kilo
  <dbl> <chr>  <int>
1  2011 cherry     8
2  2012 cherry     6
3  2014 cherry     1
4  2010 kiwi       3
5  2011 kiwi      10
6  2013 kiwi       2

Take-aways

  1. Make sure you understand what kind of spatio-temporal data you’re working on/ talking about. What’s the extent (range) and resolution (frequency) of your spatial (temporal) data.

  2. For data analysis, you don’t need detailed polygons. You don’t necessarily want to be a cartographer, simple polygons are good enough! Simplify it with ms_simplify() and st_simplify().

  3. Take some care on the coordinates reference system (CRS) when working on multiple spatial data objects. st_crs() is your friend.

  4. Check for duplicates may not be that important to you at this particular moment, but correcting them now saves you from the unexpected long way ahead.

Your time 🔧

  • Plot the Victoria map from ozmaps::abs_ste with geom_sf(). Try different simplify parameters (keep/ dTolerance) to find the one works the best with the map.

  • Plot the world map using data from the rnaturalearth package in the world mollweide projection

  • Resolve the duplication issue in the raw data (created in the exercise file) and cast it into a tsibble.

Further reading