Tutorial: how to use pangeo-fish#
Overview.
This Jupyter notebook demonstrates how to use pangeo-fish.
Specifically, we will fit the geolocation on the data from the study conducted by M. Gonze et al. titled “Combining acoustic telemetry with archival tagging to investigate the spatial dynamics of the understudied pollack Pollachius pollachius”, accepted for publication in the Journal of Fish Biology.
We will use the biologging tag “A19124”, which was attached to pollack fish.
As for the reference Earth Observation (EO) data, we consider the European Union Copernicus Marine Service Information (CMEMS) product “NORTHWESTSHELF_ANALYSIS_FORECAST_PHY_004_013”.
NB: In addition to the Data Storage Tag (DST), the biologging data includes teledetection by acoustic signals, as well as the release and recapture/death information of the fish.
Both the reference EO and the biologging data are publicly available, and the computations should be tractable for most standard laptops.
Workflow.
Let’s first summarize the key steps for running the geolocation:
Define the configuration: define the required parameters for the analysis.
Compare the reference data with the DST information: compare the data from the reference model with the biologging data.
Regrid the comparison to HEALPix: translate the comparison into a HEALPix grid to avoid spatial distortion.
Construct the temporal emission matrix: create a temporal emission probability distribution (pdf) from the transformed grid.
Construct another emission matrix with the acoustic detections: calculate a similar model to the previous one, using this time the acoustic teledetections.
Combine and normalize the matrices: merge and normalize the two pdfs.
Estimate (or fit) the geolocation model: determine the parameters of the model based on the normalized emission matrix.
Compute the state probabilities and generate trajectories: compute the fish’s location probability distribution and generate subsequent trajectories.
Visualization: visualize the evolution of the spatial probabilities over time and export the video.
Throughout this tutorial, you will gain practical experience in setting up and executing a typical workflow using pangeo-fish such that you can then apply the tool with your use-case study.
1. Initialization and configuration definition#
In this step, we prepare the execution of the analysis. It includes:
Installing the necessary packages.
Importing the required libraries.
Defining the parameters for the next stages of the workflow.
Configuring the cluster for distributed computing.
[ ]:
!pip install rich zstandard
!pip install "xarray-healpy @ git+https://github.com/iaocea/xarray-healpy.git@0ffca6058f4008f4f22f076e2d60787fcf32ac82"
# !pip install -e ../.
!pip install movingpandas more_itertools
!pip install xarray --upgrade
!pip install xdggs
!pip install git+https://github.com/IAOCEA/healpix-convolution
[ ]:
from pint_xarray import unit_registry as ureg
import hvplot.xarray
import xarray as xr
import sys
sys.path.append("../")
import pangeo_fish
[ ]:
# tag_name corresponds to the name of the biologging tag name (DST identification number),
# which is also a path for storing all the information for the specific fish tagged with tag_name.
tag_name = "A19124"
# tag_root specifies the root URL for tag data used for this computation.
tag_root = "https://data-taos.ifremer.fr/data_tmp/cleaned/tag/"
# ref_url is the path to the reference model
ref_url = "https://data-taos.ifremer.fr/kerchunk/ref-copernicus.yaml"
# scratch_root specifies the root directory for storing output files.
# storage_options specifies options for the filesystem storing output files.
## example for remote storage
scratch_root = "s3://destine-gfts-data-lake/demo"
storage_options = {
"anon": False,
"profile": "gfts",
"client_kwargs": {
"endpoint_url": "https://s3.gra.perf.cloud.ovh.net",
"region_name": "gra",
},
}
## example for using your local file system instead
scratch_root = "."
storage_options = None
# Default chunk value for time dimension. This values depends on the configuration of your dask cluster.
chunk_time = 24
# Either to use a HEALPix grid (["cells"]) or a 2D grid (["x", "y"])
dims = ["cells"]
# bbox, bounding box, defines the latitude and longitude range for the analysis area.
bbox = {"latitude": [46, 51], "longitude": [-8, -1]}
# relative_depth_threshold defines the acceptable fish depth relative to the maximum tag depth.
# It determines whether the fish can be considered to be in a certain location based on depth.
relative_depth_threshold = 0.8
# optional rotation for the HEALPix grid
rot = {"lat": 0, "lon": 0}
# nside defines the resolution of the healpix grid used for regridding.
nside = 4096
# min_vertices sets the minimum number of vertices for a valid transcription for regridding.
min_vertices = 1
# differences_std sets the standard deviation for scipy.stats.norm.pdf.
# It expresses the estimated certainty of the field of difference.
differences_std = 0.75
# recapture_std sets the covariance for recapture event.
# It shows the certainty of the final recapture area if it is known.
recapture_std = 1e-2
# earth_radius defines the radius of the Earth used for distance calculations.
earth_radius = ureg.Quantity(6371, "km")
# maximum_speed sets the maximum allowable speed for the tagged fish.
maximum_speed = ureg.Quantity(60, "km / day")
# adjustment_factor adjusts parameters for a more fuzzy search.
# It will factor the allowed maximum displacement of the fish.
adjustment_factor = 5
# truncate sets the truncating factor for computed maximum allowed sigma for convolution process.
truncate = 4
# receiver_buffer sets the maximum allowed detection distance for acoustic receivers.
receiver_buffer = ureg.Quantity(1000, "m")
# tolerance describes the tolerance level of the search during the fitting/optimization of the geolocation.
# Smaller values will make the optimization iterate more
tolerance = 1e-3 if dims == ["x", "y"] else 1e-6
# track_modes defines the modes for generating fish's trajectories.
track_modes = ["mean", "mode"]
# additional_track_quantities sets quantities to compute for tracks using moving pandas.
additional_track_quantities = ["speed", "distance"]
# time_step defines the time interval between each frame of the visualization
time_step = 3
[ ]:
# Define target root directories for storing analysis results.
target_root = f"{scratch_root}/{tag_name}"
# Defines default chunk size for optimization.
default_chunk = {"time": chunk_time, "lat": -1, "lon": -1}
default_chunk_dims = {"time": chunk_time}
default_chunk_dims.update({d: -1 for d in dims})
[ ]:
# Set up a local cluster for distributed computing.
from distributed import LocalCluster
cluster = LocalCluster()
client = cluster.get_client()
client
Now that everything is set up, we can start by loading the biologging data (or tag)
[ ]:
from pangeo_fish.helpers import load_tag
tag, tag_log, time_slice = load_tag(
tag_root=tag_root, tag_name=tag_name, storage_options=storage_options
)
tag
You can plot the time series of the DST with the function plot_tag():
[ ]:
from pangeo_fish.helpers import plot_tag
plot = plot_tag(
tag=tag,
tag_log=tag_log,
# you can directly save the plot if you want
save_html=True,
storage_options=storage_options,
target_root=target_root,
)
plot
2. Compare the reference data with the DST logs#
In this step, we compare the reference model data with Data Storage Tag information. The process involves reading and cleaning the reference model, aligning time, converting depth units and subtracting the tag data from the model. We also illustrate how to plot and saving the result.
[ ]:
from pangeo_fish.helpers import load_model, compute_diff
reference_model = load_model(
uri=ref_url,
tag_log=tag_log,
time_slice=time_slice,
bbox=(bbox | {"max_depth": tag_log["pressure"].max()}),
chunk_time=chunk_time,
)
diff = compute_diff(
reference_model=reference_model,
tag_log=tag_log,
relative_depth_threshold=relative_depth_threshold,
chunk_time=chunk_time,
)[0]
We can detect abnormal data by looking at the number of non null values for each time step.
[ ]:
diff = diff.compute()
[ ]:
diff["diff"].count(["lat", "lon"]).plot()
diff
[ ]:
diff.to_zarr(f"{target_root}/diff.zarr", mode="w", storage_options=storage_options)
3. HEALPix regridding#
In this step, we regrid the data from above to HEALPix coordinates.
This is a complex process, composed of several steps such as defining the HEALPix grid, creating the target grid and computing interpolation weights
Fortunately though, pangeo-fish embarks high-level functions to do the work for us!
[ ]:
from pangeo_fish.helpers import open_diff_dataset, regrid_dataset
[ ]:
# Open the previous dataset (only necessary if you resume the notebook from here)
diff = open_diff_dataset(target_root=target_root, storage_options=storage_options)
diff
[ ]:
reshaped = regrid_dataset(
ds=diff, nside=nside, min_vertices=min_vertices, rot=rot, dims=dims
)[0]
reshaped
Let’s plot the same chart as before to check that the HEALPix regridding hasn’t changed the data
[ ]:
reshaped["diff"].count(dims).plot()
[ ]:
# Saves the result
reshaped.chunk(default_chunk_dims).to_zarr(
f"{target_root}/diff-regridded.zarr",
mode="w",
consolidated=True,
compute=True,
storage_options=storage_options,
)
4. Compute the emission probability distribution#
In this step, we use the comparison result from the step above to construct the emission probability matrix.
This comparison is essentially he differences between the temperature measured by the tag and the reference sea temperature.
The emission probability matrix represents the likelihood of observing a specific temperature difference given the model parameters and configurations.
[ ]:
from pangeo_fish.helpers import compute_emission_pdf
[ ]:
# Open the previous dataset (only necessary if you resume the notebook from here)
differences = xr.open_dataset(
f"{target_root}/diff-regridded.zarr",
engine="zarr",
chunks={},
storage_options=storage_options,
).pipe(lambda ds: ds.merge(ds[["latitude", "longitude"]].compute()))
# ... and compute the emission matrices
emission_pdf = compute_emission_pdf(
diff_ds=differences,
events_ds=tag["tagging_events"].ds,
differences_std=differences_std,
recapture_std=recapture_std,
dims=dims,
chunk_time=chunk_time,
)[0]
emission_pdf
Whatever the temporal distribution looks like, they must never (i.e, at any time step) sum to 0.
How could we check that visually? You’d have guessed it by now: similarly as before!
[ ]:
emission_pdf = emission_pdf.chunk(default_chunk_dims).persist()
emission_pdf["pdf"].count(dims).plot()
[ ]:
# Save the dataset
emission_pdf.to_zarr(
f"{target_root}/emission.zarr",
mode="w",
consolidated=True,
storage_options=storage_options,
)
5. Compute a 2nd pdf with the acoustic detections#
In this step, the goal is to calculate another emission distribution, this time from the acoustic detections. As such, it requires the tag to include at least one detection.
These additional probabilities will enhance the emission pdf constructed in the previous step by incorporating information from acoustic telemetry.
NB: we will merge and normalize the two pdfs in the next stage of the workflow.
[ ]:
from pangeo_fish.helpers import compute_acoustic_pdf
[ ]:
# Load the previous emission pdf and compute the emission probabilities based on acoustic detections
emission_pdf = xr.open_dataset(
f"{target_root}/emission.zarr",
engine="zarr",
chunks={},
storage_options=storage_options,
) # chunk?
acoustic_pdf = compute_acoustic_pdf(
emission_ds=emission_pdf,
tag=tag,
receiver_buffer=receiver_buffer,
chunk_time=chunk_time,
dims=dims,
)[0]
acoustic_pdf = acoustic_pdf.persist()
If you wonder how this emission matrix looks like, you can plot a combined plot of the detections and the probabilities:
[ ]:
tag["acoustic"]["deployment_id"].hvplot.scatter(c="red", marker="x") * (
acoustic_pdf["acoustic"] != 0
).sum(dim=dims).hvplot()
Explanations#
On the plot above, at detection times the number of counted values drop to a few value (5 in this example).
These numbers correspond to the number of pixels that covers the detection area.
Therefore, such drop is expected, since at those times we know that the fish was detected around the acoustic receivers, and so it can’t be elsewhere.
These sporadic detections will constraint a lot the geolocation model upon optimizing!
The next cell is optional. It will save the acoustic emission distribution. It is not necessary (see the next step).
[ ]:
acoustic_pdf.to_zarr(
f"{target_root}/acoustic.zarr",
mode="w",
consolidated=True,
storage_options=storage_options,
)
6. Combine and normalize the two distributions#
As mentioned before, before fitting the model, we need to merge the emission distribution with the acoustic one and normalize the result.
The normalization ensures that the probabilities sum up to one for each time step.
[ ]:
from pangeo_fish.helpers import combine_pdfs
[ ]:
combined = combine_pdfs(
emission_ds=emission_pdf,
acoustic_ds=acoustic_pdf,
chunks=default_chunk_dims,
dims=dims,
)[0]
combined.to_zarr(
f"{target_root}/combined.zarr",
mode="w",
consolidated=True,
storage_options=storage_options,
)
Let’s perform a last check before fitting the model’s parameters.
Indeed, remind that the pdfs have been combined as their product.
As such, for some time steps, if they don’t overlap, the result will be empty (probability of 0 everywhere!).
We can quickly detect if this happens by plotting the sum of the probabilities over the time dimension:
[ ]:
combined["pdf"].sum(dims).plot(ylim=(0, 2))
The sums should equal to ``1``.
7. Estimate the model’s parameters#
It is now time to determine the parameters of the model based on the normalized emission matrix.
Precisely, is consists of finding the best sigma, which corresponds to the standard deviation of the Brownian motion that models the fish’s movement between the time steps.
To do so, in the following we:
Define the lower and upper bounds for
sigma.Search for the best
sigmawithoptimize_pdf().Save the results of the search (i.e.,
sigma), along with any additional parameters used during the optimization, a human-readable.jsonfile.
[ ]:
from pangeo_fish.helpers import optimize_pdf
[ ]:
# Open the distributions
emission = xr.open_dataset(
f"{target_root}/combined.zarr",
engine="zarr",
chunks=default_chunk_dims,
inline_array=True,
storage_options=storage_options,
)
# Define the parameter's bounds and search for the best value
params = optimize_pdf(
ds=emission,
earth_radius=earth_radius,
adjustment_factor=adjustment_factor,
truncate=truncate,
maximum_speed=maximum_speed,
tolerance=tolerance,
dims=dims,
# the results can be directly saved
save_parameters=True,
storage_options=storage_options,
target_root=target_root,
)
params
8. State probabilities and Trajectories#
In this second to last step, we calculate the spatial probability distribution (based on the sigma found earlier) and further compute trajectories.
NB: the computation precisely relies on ``sigma`` and the combined emission pdf.
[ ]:
from pangeo_fish.helpers import predict_positions
[ ]:
states, trajectories = predict_positions(
target_root=target_root,
storage_options=storage_options,
chunks=default_chunk_dims,
track_modes=track_modes,
additional_track_quantities=additional_track_quantities,
save=True,
)
Let’s quickly check that the positional probability distribution states never sums to 0 for all timesteps!
[ ]:
(
states.sum(dims).hvplot(width=500, ylim=(0, 2), title="Sum of the probabilities")
+ states.count(dims).hvplot(width=500, title="Number of none-zero probabilities")
).opts(shared_axes=False)
9. Visualization#
In this final step, we visualize various aspects of the analysis results to gain insights and interpret the model outcomes.
We plot the emission distribution, which represents the likelihood of observing a specific temperature difference given the model parameters and configurations.
Additionally, we visualize the state probabilities, showing the likelihood of the system (i.e, the fish) being in different states (i.e, positions) at each time step.
We also plot the trajectories decoded before (if you saved them).
They display the possible movement patterns over time.
Finally, we render the emission matrix and state probabilities in a video and store it.
9.1 Plotting the trajectories#
[ ]:
from pangeo_fish.helpers import plot_trajectories
[ ]:
plot = plot_trajectories(
target_root=target_root,
track_modes=track_modes,
storage_options=storage_options,
save_html=True,
)
plot
9.2 Plotting the states and emission distributions#
[ ]:
from pangeo_fish.helpers import open_distributions, render_distributions
[ ]:
data = open_distributions(
target_root=target_root,
storage_options=storage_options,
chunks=default_chunk_dims,
chunk_time=chunk_time,
)
data
The interactive plot above is too large to be stored as a HMTL file (as done earlier with the trajectories).
Fortunately, pangeo-fish can efficiently render images of data and build a video from them!
[ ]:
%pip install imageio[ffmpeg]
[ ]:
video_filename = render_distributions(
data=data,
output_path=f"{target_root}/states",
xlim=bbox["longitude"],
ylim=bbox["latitude"],
time_step=time_step,
extension="mp4",
frames_dir="images",
remove_frames=True,
storage_options=storage_options,
)