Binder

hsclient HydroShare Python Client Resource Aggregation Data Object Operation Examples


The following code snippets show examples for how to use the hsclient HydroShare Python Client to load certain aggregation data types to relevant data processing objects to view data properties as well as be able to modify the data. The aggregation data object feature is available for the following HydroShare's content type aggregations:

  • Time series
  • Geographic feature
  • Geographic raster
  • Multidimensional NetCDF
  • CSV

Install the hsclient Python Client

The hsclient Python Client for HydroShare may not be installed by default in your Python environment, so it has to be installed first before you can work with it. Use the following command to install hsclient via the Python Package Index (PyPi). This will install the hsclient as well as all the python packages to work with aggregation data as data processing objects. The following packages will be installed in addition to hsclient:

  • pandas
  • fiona
  • rasterio
  • xarray
!pip install hsclient[all]

Authenticating with HydroShare

Before you start interacting with resources in HydroShare you will need to authenticate.

import os
from hsclient import HydroShare

hs = HydroShare()
hs.sign_in()

Loading Resource Aggregation Data to Relevant Python Data Analysis Modules

The python data analysis module used for each of the supported aggregation types is shown below:

  • Time series : pandas.DataFrame
  • Geographic feature : fiona.Collection
  • Geographic raster : rasterio.DatasetReader
  • Multidimensional NetCDF : xarray.Dataset
  • CSV : pandas.DataFrame

In the following code examples, we are assuming that we have a resource in HydroShare that contains the above five aggregation types. All these aggregations are at the root of the resource. The resource id used in the following code examples is "a0e0c2e2e5e84e1e9b6b2b2b2b2b2b2b". You will need to change this resource id to the id of your resource in HydroShare.

# first we need to get the resource object from HydroShare using id of the resource
resource_id = 'a0e0c2e2e5e84e1e9b6b2b2b2b2b2b2b'
resource = hs.resource(resource_id)
# show resource identifier
print(f"Resource ID:{resource.resource_id}")

Loading Time Series Data to pandas.DataFrame

Here we are assuming the time series aggregation contains a sqlite file with name "sample.sqlite"

# retrieve the time series aggregation
file_path = "sample.sqlite"
ts_aggr = resource.aggregation(file__path=file_path)
# show the aggregation type
print(f"Aggregation Type:{ts_aggr.metadata.type}")
# display the time series results metadata to see the all available series
# later we will use one of the series ids to retrieve the time series data
print(ts_aggr.metadata.time_series_results)
# download the time series aggregation - these directory paths must exist for hsclient to download and unzip the aggregation zip file
# Note: These directory paths need to be changed based on where you want to download the aggregation

base_working_dir = "aggr_objects"
download_to = os.path.join(base_working_dir, "timeseries_testing")
unzip_to = os.path.join(download_to, "aggr_unzipped")
aggr_path = resource.aggregation_download(aggregation=ts_aggr, save_path=download_to, unzip_to=unzip_to)
print(f"Downloaded aggregation to:{aggr_path}")
# load a given time series of the aggregation as pandas.DataFrame from the downloaded location (aggr_path)

# Note: Here we are assuming the series id used below is one of the ids we found when we printed the
# time series results in the earlier coding step

series_id = '51e31687-1ebc-11e6-aa6c-f45c8999816f'
pd_dataframe = ts_aggr.as_data_object(series_id=series_id, agg_path=aggr_path)
print(f"Type of data processing object:{type(pd_dataframe)}")
# now we can use the pandas.DataFrame to do some data analysis

# show time series column headings
print(pd_dataframe.columns)
# show time series data summary
print(pd_dataframe.info)
# show number of data points in time series
print(pd_dataframe.size)
# show first 5 records in time series
print(pd_dataframe.head(5))
# editing time series aggregation data using the pandas.DataFrame
print(f"Data frame size before edit:{pd_dataframe.size}")
rows, columns = pd_dataframe.shape
print(f"Number of rows:{rows}")
print(f"Number of columns:{columns}")
# delete 10 rows from the dataframe. This will result in deleting 10 records from the 'TimeSeriesResultValues' table when we save the dataframe.
pd_dataframe.drop(pd_dataframe.index[0:10], axis=0, inplace=True)
rows, columns = pd_dataframe.shape
print(f"Number of rows in dataframe after delete:{rows}")
print(f"Number of columns in dataframe after delete:{columns}")
print(f"Data frame size after delete:{pd_dataframe.size}")
expected_row_count = rows
# save the updated dataframe object to the time series aggregation in HydroShare
# Note this will update the data for the existing time series aggregation in HydroShare - this operation may take a while 
ts_aggr = ts_aggr.save_data_object(resource=resource, agg_path=aggr_path, as_new_aggr=False)
print(f"Updated time series aggregation ...")
# we can also create a new time series aggregation in HydroShare using the updated dataframe object
# we will first create a new folder in which the new aggregation will be created
aggr_folder = "ts_folder"
resource.folder_create(folder=aggr_folder)
# this operation may take a while  
ts_aggr = ts_aggr.save_data_object(resource=resource, agg_path=aggr_path, as_new_aggr=True,
                                   destination_path=aggr_folder)
print(f"Created a new time series aggregation ...")
# retrieve the updated time series aggregation to verify the data was updated
# reload the new timeseries as pandas.DataFrame
# need to first download this new aggregation

aggr_path = resource.aggregation_download(aggregation=ts_aggr, save_path=download_to, unzip_to=unzip_to)
print(f"Downloaded aggregation to:{aggr_path}")
pd_dataframe = ts_aggr.as_data_object(series_id=series_id, agg_path=aggr_path)
rows, columns = pd_dataframe.shape
print(f"Number of rows in the updated timeseries:{rows}")
print(f"Number of columns in the updated timeseries:{columns}")
assert rows == expected_row_count

Loading Geographic Feature Data to fiona.Collection

Here we are assuming the geographic feature aggregation contains a shapefile with name "sample.shp"

# retrieve the geographic feature aggregation
file_path = "sample.shp"
gf_aggr = resource.aggregation(file__path=file_path)
# show the aggregation type
print(f"Aggregation Type:{gf_aggr.metadata.type}")
# download the geographic feature aggregation - these directory paths must exist for hsclient to download and unzip the aggregation zip file
# Note: These directory paths need to be changed based on where you want to download the aggregation
download_to = os.path.join(base_working_dir, "geofeature_testing")
unzip_to = os.path.join(download_to, "aggr_unzipped")
aggr_path = resource.aggregation_download(aggregation=gf_aggr, save_path=download_to, unzip_to=unzip_to)
print(f"Downloaded aggregation to:{aggr_path}")
# load the downloaded geo-feature aggregation as a fiona Collection object
fiona_coll = gf_aggr.as_data_object(agg_path=aggr_path)
print(f"Type of data processing object:{type(fiona_coll)}")
# now we can use the fiona.Collection object to do some data analysis

# show driver used to open the vector file
print(fiona_coll.driver)
# show feature collection coordinate reference system
print(fiona_coll.crs)
# show feature collection spatial coverage
print(fiona_coll.bounds)
# show number of features/bands
print(len(list(fiona_coll)))
# show feature field information
print(fiona_coll.schema)
# show data for a single feature in feature collection
from fiona.model import to_dict

feature = fiona_coll[1]
to_dict(feature)
# editing geographic feature aggregation data using the fiona.Collection object
import fiona

# location of the new output shp file
# Note: The output shapefile directory path must exist.
output_shp_file_dir_path = os.path.join(download_to, "updated_aggr")

# name the output shape file same as the original shape file
orig_shp_file_name = os.path.basename(gf_aggr.main_file_path)
output_shp_file_path = os.path.join(output_shp_file_dir_path, orig_shp_file_name)

# here we will remove one of the bands (where the state name is Alaska) and then write the updated data to a new shp file
# Note: You have to use a different criteria for selecting bands depending on your feature dataset
with fiona.open(output_shp_file_path, 'w', schema=fiona_coll.schema, driver=fiona_coll.driver,
                crs=fiona_coll.crs) as out_shp_file:
    for feature in fiona_coll:
        ft_dict = to_dict(feature)
        if ft_dict['properties']['STATE_NAME'] != "Alaska":
            out_shp_file.write(feature)
        else:
            print(">> Skipping feature for Alaska")

print("Done updating the shp file ...")
# we can now update the geographic feature aggregation in HydroShare using the updated shp file - this operation may take a while 
gf_aggr = gf_aggr.save_data_object(resource=resource, agg_path=output_shp_file_dir_path, as_new_aggr=False)
print("Aggregation updated ...")
# we can also create a new geographic feature aggregation in HydroShare using the updated shp file

# we will first create a new folder in which the new aggregation will be created in HydroShare
aggr_folder = "gf_folder"
resource.folder_create(folder=aggr_folder)
# first retrieve the data object from the updated shp file - this step is not needed if your have not saved the object previously
fiona_coll = gf_aggr.as_data_object(agg_path=output_shp_file_dir_path)
# this operation may take a while
gf_aggr = gf_aggr.save_data_object(resource=resource, agg_path=output_shp_file_dir_path, as_new_aggr=True,
                                 destination_path=aggr_folder)
print("New aggregation created ...")
# retrieve the updated geographic feature aggregation to verify the data was updated
# need to first download this updated/new aggregation
aggr_path = resource.aggregation_download(aggregation=gf_aggr, save_path=download_to, unzip_to=unzip_to)
fiona_coll = gf_aggr.as_data_object(agg_path=aggr_path)
# check the number of bands in the updated aggregation
print(len(list(fiona_coll)))

Loading Multidimensional Data to xarray.Dataset

Here we are assuming the multidimensional aggregation contains a netcdf file with name "sample.nc"

# retrieve the multidimensional aggregation
file_path = "sample.nc"
md_aggr = resource.aggregation(file__path=file_path)
print(f"Aggregation Type:{md_aggr.metadata.type}")
# download the multidimensional aggregation - these directory paths must exist for hsclient to download and unzip the aggregation zip file
# Note: These directory paths need to be changed based on where you want to download the aggregation
download_to = os.path.join(base_working_dir, "netcdf_testing")
unzip_to = os.path.join(download_to, "aggr_unzipped")
aggr_path = resource.aggregation_download(aggregation=md_aggr, save_path=download_to, unzip_to=unzip_to)
print(f"Downloaded aggregation to:{aggr_path}")
# load the downloaded multidimensional aggregation as a xarray.Dataset object
xarray_ds = md_aggr.as_data_object(agg_path=aggr_path)
print(f"Type of data processing object:{type(xarray_ds)}")
# now we can use the xarray.Dataset object to do some data analysis

# show netcdf global attributes
print(xarray_ds.attrs)
# show netcdf dimensions
print(xarray_ds.dims)
# show coordinate variables of the netcdf dataset
print(xarray_ds.coords)
# editing multidimensional aggregation data using the xarray.Dataset object

# here we will only change the title attribute of the dataset
aggr_title = "This is a modified title for this aggregation modified using hsclient"
xarray_ds.attrs["title"] = aggr_title
# we can update the multidimensional aggregation in HydroShare using the updated xarray.Dataset object - this operation may take a while
md_aggr = md_aggr.save_data_object(resource=resource, agg_path=aggr_path, as_new_aggr=False)
print("Aggregation updated ...")
# we can also create a new multidimensional aggregation in HydroShare using the updated xarray.Dataset object

# we will first create a new folder in which the new aggregation will be created
aggr_folder = "md_folder"
resource.folder_create(folder=aggr_folder)
# first retrieve the data object from the updated netcdf file - this step is not needed if your have not saved the object previously
xarray_ds = md_aggr.as_data_object(agg_path=aggr_path)
# this operation may take a while
md_aggr = md_aggr.save_data_object(resource=resource, agg_path=aggr_path, as_new_aggr=True,
                                 destination_path=aggr_folder)
print("New aggregation created ...")
# retrieve the updated multidimensional aggregation to verify the data was updated

# need to first download this updated/new aggregation
aggr_path = resource.aggregation_download(aggregation=md_aggr, save_path=download_to, unzip_to=unzip_to)
xarray_ds = md_aggr.as_data_object(agg_path=aggr_path)
# check the title attribute of the updated aggregation
assert xarray_ds.attrs["title"] == aggr_title

Loading Geo Raster Data to rasterio.DatasetReader

Here we are assuming the georaster aggregation contains a geotiff file with name "sample.tif"

# retrieve the georaster aggregation
file_path = "sample.tif"
gr_aggr = resource.aggregation(file__path=file_path)
print(f"Aggregation Type:{gr_aggr.metadata.type}")
# download the georaster aggregation - these directory paths must exist for hsclient to download and unzip the aggregation zip file
# Note: These directory paths need to be changed based on where you want to download the aggregation
download_to = os.path.join(base_working_dir, "georaster_testing")
unzip_to = os.path.join(download_to, "aggr_unzipped")
aggr_path = resource.aggregation_download(aggregation=gr_aggr, save_path=download_to, unzip_to=unzip_to)
print(f"Downloaded aggregation to:{aggr_path}")
# load the downloaded georaster aggregation as a rasterio.DatasetReader object
rasterio_ds = gr_aggr.as_data_object(agg_path=aggr_path)
print(f"Type of data processing object:{type(rasterio_ds)}")
# now we can use the rasterio.DatasetReader object to do some data analysis

# show raster band count
print(rasterio_ds.count)
# show raster band dimensions
print(rasterio_ds.width, rasterio_ds.height)
# show raster coordinate reference system
print(rasterio_ds.crs)
# show raster bounds
print(rasterio_ds.bounds)
# show raster data
data = rasterio_ds.read()
print(data)
# editing georaster aggregation data using the rasterio.DatasetReader object
from rasterio.windows import Window
import rasterio

# here we will subset the raster data to a smaller extent
print("raster dimensions before editing:")
print(f"raster width :{rasterio_ds.width}")
print(f"raster height:{rasterio_ds.height}")
new_width = rasterio_ds.width - 9
new_height = rasterio_ds.height - 10
subset_window = Window(0, 0, new_width, new_height)
subset_band = rasterio_ds.read(1, window=subset_window)
print(subset_band)
# write the subset data to a new tif file - note the target directory must be empty
# Note: The original raster aggregation may have more than one tif files. The following update will always result in an updated or new aggregation
# with a single tif file.

output_raster_dir_path = os.path.join(base_working_dir, "georaster_testing", "updated_aggr")
output_raster_filename = "out_sample.tif"
output_raster_file_path = os.path.join(output_raster_dir_path, output_raster_filename)
profile = rasterio_ds.profile
rasterio_ds.close()
profile['driver'] = "GTiff"
profile['width'] = new_width
profile['height'] = new_height

with rasterio.open(output_raster_file_path, "w", **profile) as dst:
    dst.write(subset_band, 1)

print(f"Saved subset raster to:{output_raster_file_path}")
# we can update the georaster aggregation in HydroShare using the updated rasterio.DatasetReader object - this operation may take a while
gr_aggr = gr_aggr.save_data_object(resource=resource, agg_path=output_raster_dir_path, as_new_aggr=False)
print("Aggregation updated ...")
# we can also create a new georaster aggregation in HydroShare using the updated rasterio.DatasetReader object

# If you have already updated the aggregation as described in the previous cell, then you have to first download the updated aggregation and load the
# rasterio.DatasetReader object from the downloaded location before you can save the updated raster to a new aggregation in HydroShare as shown below. Otherwise, you can execute the code in the next cell.

download_to = os.path.join(base_working_dir, "georaster_testing")
# note the unzip_to directory must exist and be empty
unzip_to = os.path.join(download_to, "updated_aggr_unzipped")
aggr_path = resource.aggregation_download(aggregation=gr_aggr, save_path=download_to, unzip_to=unzip_to)
print(f"Downloaded aggregation to:{aggr_path}")

# reload the updated raster as rasterio.DatasetReader
rasterio_ds = gr_aggr.as_data_object(agg_path=aggr_path)
# we can also create a new georaster aggregation in HydroShare using the updated rasterio.DatasetReader object

# we will first create a new folder in which the new aggregation will be created
aggr_folder = "gr_folder"
resource.folder_create(folder=aggr_folder)
# this operation may take a while
gr_aggr = gr_aggr.save_data_object(resource=resource, agg_path=output_raster_dir_path, as_new_aggr=True,
                                 destination_path=aggr_folder)
print("New aggregation created ...")
# retrieve the updated georaster aggregation to verify the data was updated

# need to first download this updated/new aggregation
download_to = os.path.join(base_working_dir, "georaster_testing")
# note the unzip_to directory must exist and be empty
unzip_to = os.path.join(download_to, "aggr_unzipped")
aggr_path = resource.aggregation_download(aggregation=gr_aggr, save_path=download_to, unzip_to=unzip_to)
rasterio_ds = gr_aggr.as_data_object(agg_path=aggr_path)
# check the raster dimensions of the updated aggregation
print("raster dimensions after editing:")
print(f"raster width :{rasterio_ds.width}")
print(f"raster height:{rasterio_ds.height}")

Loading CSV Data to pandas.DataFrame

Here we are assuming the CSV aggregation contains a CSV file with name "sample.csv"

# retrieve the CSV aggregation
file_path = "sample.csv"
csv_aggr = resource.aggregation(file__path=file_path)
# show the aggregation type
print(f"Aggregation Type:{csv_aggr.metadata.type}")
# download the CSV aggregation - these directory paths must exist for hsclient to download and unzip the aggregation zip file
# Note: These directory paths need to be changed based on where you want to download the aggregation
download_to = os.path.join(base_working_dir, "csv_testing")
unzip_to = os.path.join(download_to, "aggr_unzipped")
aggr_path = resource.aggregation_download(aggregation=csv_aggr, save_path=download_to, unzip_to=unzip_to)
print(f"Downloaded aggregation to:{aggr_path}")
# load the CSV aggregation as pandas.DataFrame
csv_df = csv_aggr.as_data_object(agg_path=aggr_path)
# show number of rows and columns
print(f"Number of data rows:{len(csv_df)}")
print(f"Number of data columns:{len(csv_df.columns)}")
# show the first 5 data rows
print(csv_df.head(5))
# show the extracted CSV aggregation metadata (table schema)
table_schema = csv_aggr.metadata.tableSchema
table = table_schema.table
print(f"Number of data rows:{table_schema.rows}")
print(f"Number of data columns:{len(table.columns)}")
print(f"Delimiter:{table_schema.delimiter}")

# show data column properties
for col in table.columns:
    print(f"Column number:{col.column_number}")
    print(f"Column title:{col.title}")
    print(f"Column description:{col.description}")
    print(f"Column data type:{col.datatype}")
    print("-"*50) 

Editing CSV aggregation using pandas.DataFrame

# drop the last data column - note all editing needs to be in 'inplace' mode
csv_df.drop(csv_df.columns[-1], axis=1, inplace=True)
# show the number of data columns after the edit
print(f"Number of data columns after edit:{len(csv_df.columns)}")
# save the updated CSV aggregation in Hydroshare
# Note this will overwrite the original aggregation - this operation may take a while
csv_aggr = csv_aggr.save_data_object(resource=resource, agg_path=aggr_path, as_new_aggr=False)
print("Aggregation updated ...")
# we can also create a new CSV aggregation in HydroShare using the updated pandas.DataFrame object
# we first create a new folder in which the new aggregation will be created
aggr_folder = "csv_folder"
resource.folder_create(folder=aggr_folder)

# this operation may take a while
csv_aggr = csv_aggr.save_data_object(resource=resource, agg_path=aggr_path, as_new_aggr=True, destination_path=aggr_folder)
print("New CSV aggregation was created ...")
# retrieve the updated CSV aggregation to verify the data got updated
download_to = os.path.join(base_working_dir, "csv_testing")

# note the unzip_to directory must exist and be empty
unzip_to = os.path.join(download_to, "aggr_unzipped")
aggr_path = resource.aggregation_download(aggregation=csv_aggr, save_path=download_to, unzip_to=unzip_to)
csv_df = csv_aggr.as_data_object(agg_path=aggr_path)

# show the number of data rows and columns
print(f"Number of data rows:{len(csv_df)}")
print(f"Number of data columns:{len(csv_df.columns)}")
# show the first 5 data rows
print(csv_df.head(5))