# FDE GUIDE
# 🚀 Project Initiation
[Balam Diagram Tables ](https://dbdiagram.io/d/Balam-DB-640a5096296d97641d86cead)
## Side note for flex projects
The flex projects normally are projects with configuration 1 or 2 and after creating the project, group, monitor and token, the flex project's shortname should be added to the `Repository variables` inside the pufferfish-client - Settings repository. Specifically to the variable: `REACT_APP_RAW_AUDIO_PROJECTS`and then launch the Action to rebuild the pufferfish-webclient (the action can be for staging umojatest.biometrio.earth and/or for production umoja.biometrio.earth)
## Side note for uploading test data
1. Project in Balam DB test should be created with token that allow test data to upload to umojatest.biometrio.earth
2. Project's shortname variable should be added to `REACT_APP_RAW_AUDIO_PROJECTS` in pufferfish client github repository and staging branch Github Actions should be run so every recorder data is uploaded to "audio_raw" directory.
3. Once test data is uploaded the project's name directory created under `s3://pufferfish-test/FieldData/<shortname project>` should be moved to `s3://be-upload/<shortname project_test>` and let know to stake holders the test data directory so they can take a look at them with their AWS workspace access.
Example:
move

to:

## Create Project
*Requirements:*
> - Project name
> - Shortname
> - Project configuration (with or without nodes or without sites)
> - Country
> - Group
*Tables:*
> - Projects
> - Monitors
> - SessionTokens
*Steps*
The following steps are performed in the test table (balam_test_db) and subsequently copied to the production table (balam_db).
0. Create a group in Balam test [Django Administration staging link](https://balam.biometrio.earth/staging/admin/auth/group/).
Balam prod uses the following link [Django Administration group link](https://balam.biometrio.earth/admin/auth/group/).
1. Create the project in Balam test at the following link
[Django Administration test link](https://balam.biometrio.earth/staging/admin/ProjectManagement/project/add/)
Balam prod uses the following link [Django Administration link](https://balam.biometrio.earth/admin/ProjectManagement/project/add/)
- Project's name
- Project's shortname identifier
- Project configuration
- Countries (Select project location)
- Groups (Admins)
Leave the other fields empty.
2. Create monitor in the following link
[Monitor test link](https://balam.biometrio.earth/staging/admin/ProjectManagement/monitor/add/)
[Monitor prod link](https://balam.biometrio.earth/admin/ProjectManagement/monitor/add/)
Fill the next fieds:
- First name: The monitor's name can be part of the shortname to reference the project.
- rol: leader
3. Create token in the following link
[Token test link](https://balam.biometrio.earth/staging/admin/ProjectManagement/sessiontoken/add/)
[Token prod link](https://balam.biometrio.earth/admin/ProjectManagement/sessiontoken/add/)
- Project: Select project created on the first step
- Monitor: Select monitor created on the second step
4. Copy data created from test to production
Locate the newly created data in the Projects, Monitors, and SessionTokens tables in the balam_test_db database and copy it to the balam_db database.
## Create QR codes
*Requirement:*
> - Project exist on Balam
> - Number of cameras and recorders to be created
On [Balam Test Grapql](https://apibalam.biometrio.earth/staging/)
1. Create Cameras
```
mutation {
createMultipleProjectDevices(
project: "<project_id>",
deviceType:CAMERA,
devicesNumber: 5
) {
devicesCreated
devices {
projectSerialNumber
deviceType
}
}
}
```
2. Create Recorders
- Get configuration of recoreders
```
{ allDeviceConfigs {
items {
id
name }
}}
```
```
{ "data": {
"allDeviceConfigs": { "items": [
{ "id": "<id_ultrasonic>",
"name": "Audiomoth Ultrasonic" },
{ "id": "<id_audible>",
"name": "Audiomoth Audible"
} ] } }}
```
- create the recorders
```
mutation { createMultipleProjectDevices(
project: "<project_id>",
deviceConfig: "<ultrasonic_id>"
deviceType:RECORDER, devicesNumber: 5 )
{ devicesCreated
devices { projectSerialNumber deviceType }
}}
```
Copy all devices created on balam test to balam_db table "ProjectDevices"
3. [Umoja](https://umoja.biometrio.earth/login)

- Go to the project and select < View Devices >

- Wait and Download

## Mapping Devices
### Case QR exist
*Requirements:*
> - Table for associating serial number, SD card, project serial number and device brand
*Tables:*
> - Devices
> - ProjectDevices
> - Projects
1. Review on ProjectDevices that the project_serial_number exists and is associate with the correct project_id
2. Associate serial_number and project_serial_number with the next query
```
insert into "Devices" (
serial_number,
brand
) values (
'<serial_number>',
'<brand>'
)
update "ProjectDevices"
set device_id = (
select id from "Devices" where serial_number = '<serial_number>'
)
where project_serial_number = '<project_serial_number>' and project_id = '<project_id>'
```
### Case no QR
> - Table for associating SD card and project serial number and device brand
*Tables:*
> - ProjectDevices
> - Projects
1. Create on ProjectDevices the project_serial_number
Cameras:
```
INSERT INTO "ProjectDevices" (project_id, project_serial_number, device_type, status)
VALUES (
(SELECT id FROM "Projects" WHERE shortname = '<shortname>'),
<'project_serial_number'>,
'camera',
'active'
);
```
Recorders:
```
INSERT INTO "ProjectDevices" (project_id, project_serial_number, device_type, status, device_config_id)
VALUES (
(SELECT id FROM "Projects" WHERE shortname = '<shortname>'),
'<project_serial_number>',
'recorder',
'active',
(SELECT id FROM "DeviceConfigs" WHERE config_type = '<audible>' or '<ultrasonic>')
);
```
For both cameras and recorders:
```
insert into "Devices" (
serial_number,
brand
) values (
'<serial_number>',
'<brand>'
)
update "ProjectDevices"
set device_id = (
select id from "Devices" where serial_number = '<serial_number>'
)
where project_serial_number = '<project_serial_number>' and project_id = '<project_id>'
```
(the current brands registered are: SOLARIS/ AUDIOMOTH/ Song Meter Micro/ Camara RECONYX /Go Pro/ Browning)
## Create Geometry (projects configuration 2 or 3 -sites or sampling areas-)
*Requirements:*
> - geojson with polygons of every site of the project*
This GeoJSON can be obtained through the RS API if it exists, or requested from someone on the RS team.
*Tables used:*
>- Projects
>- Sites
1. Enter to the [RS API login](https://remotesensing.services.biometrio.earth/api/)
2. Enter to the [RS API](https://remotesensing.services.biometrio.earth/api/apidoc/)
3. Use the GET /api/project/{uuid} endpoint:
- Input the UUID of the project, which is the same in the "Projects" table from Balam and RS.
`SELECT id FROM "Projects" WHERE shortname = '<shortname>'`
- Execute the request and download the JSON response.
- Extract the UUIDs of the sites from the project JSON.
4. Use the GET /api/projectsite/{uuid} endpoint:
- Input the UUID of each site and download the JSON response.
- Extract the MULTIPOLYGON data from the site JSON.
5. Go to balam_test_db and run the following query for each site:
```
INSERT INTO "Sites" (
identifier,
geometry,
project_id
) VALUES (
'<SampleSite>',
ST_GeomFromText( 'MULTIPOLYGON ((())) ),
'<uuuid>'
```
6. Copy the sites created to balam_db
## Deployments creation
*Tables used:*
> - Sites
> - SamplingAreas
> - SamplingPoints
### Case Tochtli
*Requirements:*
> - Kobo questionnaires with deployment information: date_deployment, coordinates, device configuration
1. Go to [AWS Batch Jobs](https://eu-west-1.console.aws.amazon.com/batch/home?region=eu-west-1#jobs/list) and choose on Job queue (PufferfishKoboJobQueue)
2. Click on **Submit new job** and fill in the following fields:
- **Name**: (Choose a descriptive job name)
- **Job Definition**: Select `PufferfishKoboBatchJobDef`
- **Job Queue**: Select `PufferfishKoboJobQueue`
3. Click **Next**, then:
- Under **Container overrides**, enter:
`["/app/app-kobo.sh"]`
- Under **Environment Variables**, define two variables:
- `ENV_CASE`
- `SHORTNAME`
Example:
`ENV_CASE=test`
`SHORTNAME=221003_VaNa_ZMB_Tondwa`
4. Click **Next**, then **Submit job**.
5. Once the job completes, generate the report (`data_intake`).
#### Troubleshooting & Kobo Validation
If any errors occur, validate Kobo form submissions manually.
Common issues:
❌ SCAN columns are empty or missing JSON
❌ _submitted_by is not biometrio_field
❌ Kobo project name doesn’t match the Balam shortname
#### Reset Execution Tracking (if forms were submitted but not processed)
Check recent executions with:
```sql
-- Check last run of Tochtli for the project
SELECT * FROM lastexecutions
WHERE project = '<shortname>';
-- Check last device collection run
SELECT * FROM collectiondeviceexecutions
WHERE project = '<shortname>';
```
If needed, delete the most recent row (the failed run) to allow reprocessing.
Tochtli only takes Kobo forms after the last registered execution date — if the timestamp before the form, it will be skipped.
### Manual Case
*Requirements:*
> - table with deployment information: date_deployment,node,site, coordinates, device configuration
1. Create Nodes on balam_test_db
In case the project configuration has nodes use next query:
```
INSERT INTO SamplingAreas (
identifier, -- node name
site_id,
project_id
) VALUES (
'<name of node>',
'<site_id>',
'<project_id>'
);
```
2. Create deployments using next query on balam_test_db
Case recorders:
```
insert into "SamplingPoints" (
date_deployment,
deployment_iteration,
metadata,
project_id,
sampling_area_id,
site_id,
location,
device_config,
device_id
) values (
'<date_deployment YYYY-MM-DD>',
'<deployment_iteration>',
'{
"monitor": "",
"sd_card": "<sd_card>",
"uploaded_files": false
}',
(select id from "Projects" where shortname ilike '<shortname>'),
(select id from "SamplingAreas" where identifier = '<node name>'),
(select id from "Sites" where identifier = '<site name>'),
'POINT(<long> <1at>)',
'<ultrasonic/audible>',
(select id from "ProjectDevices" where project_serial_number = '<project_serial_number>' and project_id='<>')
);
```
Case cameras:
```
insert into "SamplingPoints" (
date_deployment,
deployment_iteration,
metadata,
project_id,
sampling_area_id,
site_id,
location,
device_id
) values (
'<date_deployment YYYY-MM-DD>',
'<deployment_iteration>',
'{
"monitor": "",
"sd_card": "<sd_card>",
"uploaded_files": false
}',
(select id from "Projects" where shortname ilike '<shortname>'),
(select id from "SamplingAreas" where identifier = '<node name>'),
(select id from "Sites" where identifier = '<site name>'),
'POINT(<long> <1at>)',
(select id from "ProjectDevices" where project_serial_number = '<project_serial_number>' and project_id = '<>')
)
```
3. Copy nodes and deployments created on step 1 and 2 on balam_db
4. Create data_intake
### Create Sampling Points current deployments using master
This SQL script duplicates sampling points from a previous deployment to set up a new one within the same project. It's particularly useful when initializing a new deployment based on an existing configuration ("master").
*Required Variables*
> - `<#new_deployment>` – The deployment iteration number for the new deployment you want to create.
> - `<#old_deployment>` – The deployment iteration number of the existing deployment you want to copy from.
> - `<shortname>` – The project shortname to identify which project's sampling points to clone.
Replace these placeholders in the script before execution.
*Notes*
date_deployment and date_collected are set to **NULL** by default in this script.
If all sampling points in the master deployment have the same dates, you can update date_deployment and date_collected directly in the insert statement.
If the dates differ across sampling points, leave them as NULL during insertion, and manually update them afterward using the Master Excel sheet for the project.
Example Usage
Here’s a hypothetical example of how you might call this if:
new deployment = 4
master deployment = 3
shortname = '011223_CO2O_IDN_b.e.complete'
```sql
INSERT INTO "SamplingPoints" (
id,
deployment_iteration,
updated_at,
created_at,
date_deployment,
date_collected,
altitude,
metadata,
project_id,
sampling_area_id,
site_id,
location,
ecosystem_id,
device_id,
additional_identifier,
identifier,
device_config,
description
)
SELECT
gen_random_uuid() as id,
<#new_deployment> as deployment_iteration,
NOW() as updated_at,
NOW() as created_at,
NULL,
NULL,
altitude,
metadata,
project_id,
sampling_area_id,
site_id,
location,
ecosystem_id,
device_id,
additional_identifier,
identifier,
device_config,
description
FROM "SamplingPoints"
WHERE project_id = (
SELECT id FROM "Projects"
WHERE shortname = '<shortname>'
)
AND deployment_iteration = <#old_deployment>;
```
[Tutorial_video](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/Create_copy_deployment_on_sp.mov?csf=1&web=1&e=9me3cP)
⚠️ Important: Don’t forget that the metadata must include `'uploaded_files' = 'false'` so the data can be uploaded.
## Testing Upload
*Requirements:*
> - Data test folders audio and images_videos, [examples](https://biometrioearth.sharepoint.com/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/Forms/AllItems.aspx?csf=1&web=1&e=MicDTI&ovuser=d2f8dc66%2Db687%2D4ba9%2D953e%2D49a026cac5fb%2Cdanahi%2Eramos%40biometrio%2Eearth&OR=Teams%2DHL&CT=1717456341725&clickparams=eyJBcHBOYW1lIjoiVGVhbXMtRGVza3RvcCIsIkFwcFZlcnNpb24iOiIyOC8yNDA1MDUwMTYwMSIsIkhhc0ZlZGVyYXRlZFVzZXIiOmZhbHNlfQ%3D%3D&CID=ce282fa1%2D704f%2D9000%2D1afb%2Da4a47d50f268&cidOR=SPO&FolderCTID=0x012000AB198F220883F147809B2B5C7E08581F&id=%2Fsites%2Fdata%2Dpro%2Dand%2Dsyss%2Ddev%2Ddaily%2Dstand%2Dups%2Ddm%2FFreigegebene%20Dokumente%2Fdm%2Fpufferfish%2Fwebclient)
> - configuration.txt files for audio folders with serial_number, [example](https://biometrioearth.sharepoint.com/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/Forms/AllItems.aspx?csf=1&web=1&e=MicDTI&ovuser=d2f8dc66%2Db687%2D4ba9%2D953e%2D49a026cac5fb%2Cdanahi%2Eramos%40biometrio%2Eearth&OR=Teams%2DHL&CT=1717456341725&clickparams=eyJBcHBOYW1lIjoiVGVhbXMtRGVza3RvcCIsIkFwcFZlcnNpb24iOiIyOC8yNDA1MDUwMTYwMSIsIkhhc0ZlZGVyYXRlZFVzZXIiOmZhbHNlfQ%3D%3D&CID=ce282fa1%2D704f%2D9000%2D1afb%2Da4a47d50f268&cidOR=SPO&FolderCTID=0x012000AB198F220883F147809B2B5C7E08581F&id=%2Fsites%2Fdata%2Dpro%2Dand%2Dsyss%2Ddev%2Ddaily%2Dstand%2Dups%2Ddm%2FFreigegebene%20Dokumente%2Fdm%2Fpufferfish%2Fwebclient%2FRE0002%2FCONFIG%2ETXT&viewid=74c167fb%2Ddc78%2D4f34%2D9964%2D096f155c5ff8&parent=%2Fsites%2Fdata%2Dpro%2Dand%2Dsyss%2Ddev%2Ddaily%2Dstand%2Dups%2Ddm%2FFreigegebene%20Dokumente%2Fdm%2Fpufferfish%2Fwebclient%2FRE0002)
> - The monitor token should have an expiration date that allows the uploading in the scheduled date.
*Tables:*
> - SamplingPoints
> - UploadSessions
> - ProjectDevices
1. On [PF-Client test](https://umojatest.biometrio.earth/login) login using monitor and token credentials created on step 3 for "Create Project" section. To check the token expiration date:
```
select * from "SessionTokens" where token = '< token >'
```
2. Upload data folders for the local computer. [Tutorial video](https://www.youtube.com/watch?si=b45bL8wgNJFlyFxt&v=QtJzW5t-m9E&feature=youtu.be)
3. Review the s3 paths, that the information is well organized
4. Review SamplinPoints table on balam_test_db that the information od audios is on uploaded_files= true
5. Clean tests
- On SamplinPoints change uploaded_files= false (if button "Mark upload" was clicked)
- On ProjectDevices, review the cases where the SD card were used and erase the device_id for those cases.
- On UploadSessions run this query and earase the rows
```
select * from "UploadSessions" where project_id = (
select id from "Projects" where shortname ilike '%<shortname>%'
)
```
# 📥 After Data Arrival
## Backup
*Requirements:*
> Data already upload by Pufferfish-Client
> **shortname** of the project and **date** from backup should start
*Databases:*
> "UploadSessions"
When data is uploaded and ready for processing, we trigger a **backup step** to copy all associated raw files from the primary bucket (`s3://be-upload/`) to a backup bucket (`s3://be-upload-backup/`).
This is done by submitting a **pre-built AWS Batch job**, which:
- Queries the `UploadSessions` table in BalamDB
- Gets all `dir_path` entries that match the given project and date
- Performs `aws s3 sync` for each path
- Stores the backup with `GLACIER_IR` storage class
Follow these steps to manually trigger a backup job via the **AWS Batch Console**:
1. Go to [**AWS Batch > Job definitions**](https://eu-west-1.console.aws.amazon.com/batch/home?region=eu-west-1#job-definition)
- Select: `PufferfishBackupBatchJobDef`
- Click: **Actions → Submit new job**
2. Fill in the Job Details
- **Job name**:
Use a clear, unique name.
Example: backup-011223_CO2O_IDN-9
Use dashes `-` instead of special characters like `.` or `_` to avoid errors
3. Choose Job Queue: `PufferfishBackupBatchJobQueue`
4. Override the Command
In the **"Container overrides"** section:
- Enable **Command override**
- Paste this as the command:
```json
["/app/backup.sh"]
```
5. Add Environment Variables
Click **“Add environment variable”** and add the following two variables:
| Name | Value Example |
|----------|----------------------------------|
| `PROJECT`| `011223_CO2O_IDN_b.e.complete` |
| `DATE` | `2025-03-10` |
> **Note:** The `DATE` should be **a day before** the client started uploading files for this project and its taken from the `created_at`column in `UploadSessions`
6. Submit the Job
- Click **Next**
- Review all the job configuration
- Click **Submit job**
7. Monitoring the Job
- Go to **AWS Batch > Jobs**
- Click on the submitted job to view:
- `Status`: `RUNNING`, `SUCCEEDED`, `FAILED`
- **CloudWatch Logs**: to track which files were backed up or skipped
[Tutorial Video](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/backup.mp4?csf=1&web=1&e=2yaF0u)
## Data Intake
Data Intake is used across several steps, as it allows us to track all deployments by cycle and view information such as coordinates, file paths, and metadata related to each `SamplingPoint`. It shows what has been processed, how many files were handled, and whether any errors occurred.
This process integrates data from the **Pufferfish**, **Balam**, and **Michi** databases to generate comprehensive reports.
It requires only two parameters:
- `shortname`
- `env_case` — either `"production"` or `"test"`
example payload:
```json
{
"shortname": "011223_CO2O_IDN_b.e.complete",
"env_case": "production"
}
```
Or in case need just one specific deployment
```json
{
"shortname": "011223_CO2O_IDN_b.e.complete",
"env_case": "production",
"deployment": 7
}
```
With this payload, you can access the `data_intake` [Lambda](https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/functions/data_intake?subtab=permissions&tab=code) function and run a test.
You can acces the reports on this direction [s3://pufferfish-test/report_deployments/](https://eu-west-1.console.aws.amazon.com/s3/buckets/pufferfish-test?prefix=report_deployments/®ion=eu-west-1&bucketType=general) under the shortname of the project.
[Tutorial Video](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/data_intake.mov?csf=1&web=1&e=Z5ZArH)
### Missing Information Review
If there is a path name with missing information, request the data and update balam_test_db and balam_db. Typically, this missing information could include the identifier in the "SamplingAreas" table, an incorrect date_deployment in "SamplingPoints," or no data in paths that should contain information.
If there was a change in Balam, move the data from the incorrect path to the correct path in S3. Finally, update the SamplingPoints table in Balam by setting uploaded_files=true
## Organizer (Now handles images, videos, and audio data)
*Requirements:*
> List of media paths (images_video_raw or audio_raw) to process
> The previous data_intake cycle must have pf_client_mark = TRUE for the relevant entries
> For audios recorded by the Audiomoth the CONFIG.TXT is mandatory
To run the media organizer, you’ll need a list of S3 camera paths for the specific cycle you want to process under the directory `image_video_raw`.
For example:
```
FieldData/011223_CO2O_IDN_b.e.complete/data/images_video_raw/1741947765973
FieldData/011223_CO2O_IDN_b.e.complete/data/audio_raw/1741947765971
FieldData/011223_CO2O_IDN_b.e.complete/data/images_video_raw/1742015326403
FieldData/011223_CO2O_IDN_b.e.complete/data/audio_raw/1742015332277
```
> *Suggestion:* You can fetch this list from **BALAM** using a query like:
>
> ```sql
> SELECT * FROM "UploadSessions"
> WHERE dir_path LIKE '%011223_CO2O_IDN_b.e.complete%raw%'
> AND created_at > '2025-03-10'
> ```
>
> Use the same `created_at` cutoff date as the **backup** job to ensure consistency.
### Note: case directory recorded by Audiomoth doesn't have CONFIG.TXT
The Organizer extract the Audiomoth serial number from the CONFIG.TXT file. In case this file is not present here are audio and ultrasonic templates that can be created and included in the directories. Device ID should be fixed accordingly. Is recommended that this CONFIG.TXT is deleted once the Organizer finishes as it wasn't uploaded by the customer.
Ultrasonic:
```
Device ID : <this is the field to be changed for the organizer, example: 243B1F0663FBEEEA>
Firmware : AudioMoth-Firmware-Basic (1.8.1)
Time zone : UTC+7
Sample rate (Hz) : 384000
Gain : Medium
Sleep duration (s) : 1170
Recording duration (s) : 30
Active recording periods : 1
Recording period 1 : 11:00 - 23:10 (UTC)
Earliest recording time : ---------- --:--:--
Latest recording time : ---------- --:--:--
Filter : -
Trigger type : -
Threshold setting : -
Minimum trigger duration (s) : -
Enable LED : Yes
Enable low-voltage cut-off : Yes
Enable battery level indication : Yes
Always require acoustic chime : No
Use daily folder for WAV files : No
Disable 48Hz DC blocking filter : No
Enable energy saver mode : No
Enable low gain range : No
Enable magnetic switch : No
Enable GPS time setting : No
```
Audible:
```
Device ID : <this is the field to be changed for the organizer, example: 243B1F0663FBEEEA>
Firmware : AudioMoth-Firmware-Basic (1.8.1)
Time zone : UTC
Sample rate (Hz) : 48000
Gain : Medium
Sleep duration (s) : 540
Recording duration (s) : 60
Active recording periods : 1
Recording period 1 : 00:00 - 24:00 (UTC)
Earliest recording time : ---------- --:--:--
Latest recording time : ---------- --:--:--
Filter : -
Trigger type : -
Threshold setting : -
Minimum trigger duration (s) : -
Enable LED : Yes
Enable low-voltage cut-off : Yes
Enable battery level indication : Yes
Always require acoustic chime : No
Use daily folder for WAV files : No
Disable 48Hz DC blocking filter : No
Enable energy saver mode : No
Enable low gain range : No
Enable magnetic switch : No
Enable GPS time setting : No
```
### Triggering Media Organizer via AWS Lambda
1. **Open the Lambda Function**
Go to the AWS Console and open [lambda](https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/functions/PufferfishApplicationDevStack-StartOrganizerMediaBatchJob?tab=code):
`PufferfishApplicationDevStack-StartOrganizerMediaBatchJob`
2. **Create a New Test Event**
- Click **"Test"** → **"Create new test event"**
- Use the following JSON template:
```json
{
"object_name": [
"FieldData/011223_CO2O_IDN_b.e.complete/data/images_video_raw/1741947765973",
"FieldData/011223_CO2O_IDN_b.e.complete/data/images_video_raw/1742015326403",
"FieldData/011223_CO2O_IDN_b.e.complete/data/images_video_raw/1742015332277"
],
"env_case": "production",
"shortname": "011223_CO2O_IDN_b.e.complete",
"bucket": "be-upload"
}
```
> `object_name` should be the list of full media paths to process.
3. **Invoke the Lambda**
- After filling in the test event, click **"Test"** (or **"Invoke"**)
[Tutorial Video](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/media_organizer.mp4?csf=1&web=1&e=n7vYgh)
### Monitor the Job
- Go to [**AWS Batch > Jobs**](https://eu-west-1.console.aws.amazon.com/batch/home?region=eu-west-1#jobs/list)
- Look under the **Organizer** job queue
- You can check:
- `Status`: `RUNNING`, `SUCCEEDED`, `FAILED`
- **CloudWatch Logs** for processing details
### QA
The Organizer will move to the path in 4 different directions
1. This path is where the files will be moved once they are processed. If the correct path cannot be determined for the images and videos, a manual review of the directory content will be required to assess which path the camera data should be moved to.
```
FieldData/<project>/images_videos_raw/processed/<serial_number>
```
In case it couldn't determine the right path to move the images and videos. Therefore, a manual revision of all directory content should be made to assess which path the camera data should be moved.
2. This path is used when there are different serial numbers, or when the serial number cannot be extracted due to camera settings on the label.
```
FieldData/<project>/images_videos_raw/to_review_serialnumber/
```
You can review the files in this folder. For example, the following path might be relevant: `s3://be-upload/FieldData/<shortname>/data/images_video_raw/to_review_serialnumber/1747889529867/serial_information.txt`
The serial_information.txt file might be empty, which means no serial number could be extracted, or it might contain data like the following:
```txt
Folder: data/
File Type: images
Sampled Indices Tried: [0, 67, 135, 202, 269]
Extracted Serial Numbers: ['010001', '010002', '010002', '010002', '010002']
Folder: data/
File Type: videos
Sampled Indices Tried: [0, 33, 67, 101, 134]
Extracted Serial Numbers: ['010001', '010002', '010002', '010002', '010002']
```
This indicates that multiple devices were found in the path.
3. This path is used when a file cannot be found in Balam due to the sampling point not existing for that device, or because there is no new sampling point. This might happen if the path refers to previous sampling points that have already been processed, or if the data has been re-uploaded from the same sampling point.
```
FieldData/<project>/images_videos_raw/to_review_balam/
```
4. to_review_invalid_filetypes/ (Mixed file types)
If, for example, .wav files are found inside an images_video_raw/ folder or .jpg inside an audio_raw/ folder, the batch is aborted and moved to:
```
FieldData/<shortname>/data/<audio_raw|images_video_raw>/to_review_invalid_filetypes/<original_folder>/
```
### Handling Split Media Uploads and Folder Overwrite Conflicts
Sometimes, camera media (images and videos) are uploaded in separate folders at different times. The Organizer detects this and may prevent overwriting if the target folder in S3 already exists.
When this happens, the Organizer moves the folder to:
```
FieldData/<project>/images_videos_raw/to_review_balam/
```
When does this happen?
The Organizer detects that the target S3 path (based on serial number and sampling point) already exists, and
The upload folder contains only one type of file (e.g., only .jpg or only .mp4), and
The existing S3 path contains the opposite type
This check is used to avoid unintentional overwrites.
✅ Safe case (Organizer allows it):
Folder A (already uploaded) contains only .jpg
Folder B (new folder) contains only .mp4
Result: both are merged into the same sampling point folder (no overwrite risk)
🚫 Conflict case (Organizer blocks it):
Folder A already exists with .jpg
Folder B also contains .jpg and would go to the same path
Result: Folder B is not uploaded
The Organizer moves it to to_review_balam/ for manual review
🛠️ What to do if this happens
1. Go to BALAM and find the sampling point associated with the serial number.
2. Set the metadata field uploaded_files = false.
3. Clone and re-run the Organizer job for the folder in to_review_balam/
This will retry the upload now that the sampling point is marked as "not uploaded"
4. Important: This only works if the two folders contain different file types (e.g., .jpg and .mp4).
If both contain the same type (e.g., both .jpg), the Organizer cannot merge them — you must do it manually.
🧠 Why this matters
This behavior ensures that:
Media from the same sampling point can be safely merged if they’re complementary
Conflicts are caught early before overwriting existing data
Manual review is required when Organizer cannot guarantee safe handling but also as Organizer internal functionality takes a set of images and videos its strongly suggested to do a manual QA protrayed in the next video clips by selecting randomly files to be visualized and confirm those are in the right paths:
[organizer_qa clip 1](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/organizer_qa/1.mp4?csf=1&web=1&e=bb1RdE)
[organizer_qa clip 2](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/organizer_qa/2.mp4?csf=1&web=1&e=LsiMNh)
## File Ingestion via Beadmex & Tochtli (Balam)
*Requirements:*
> Data must already be in the correct paths where it will be processed
> Data intake up to date (consistency)
> Project sensors should be correctly registered in Devices and ProjectDevices tables inside BalamDB.
> beadmex latest docker image is registered in the ECR AWS service.
> tochtli latest github changes have been pulled and docker image is registered in the ECR AWS Service.
*Databases:*
> [Pufferfish tables](https://dbdiagram.io/d/Puferffish-DB-652f8ba7ffbf5169f0ebd243)
> [Michi tables](https://dbdiagram.io/d/Michi-DB-65c23a17ac844320ae910a7f)
### Before Starting
1. Review the consistency of the **data_intake** excel
- All the Paths that will be ingested have:
- Columns pf_client_mark=TRUE, exist_path_bucket = TRUE
if they don't have these -which probably is the case of the cameras- you can manually update them in the BalamDB if you are sure that the customer uploaded that directory and the path exists in the S3.
- For audio paths review column audio_type (audible/ultrasonic) and compare it with the one in the s3_path column

If there's a case such as:

It should be used the name under the audio_type column. For the example provided will be `audible`. This change should be reflected in:
* the `SamplingPoints` table in BalamDB under the column `device_config` for the sampling point id which is located as a column in the correspondent sheet in the data_intake excel.
* the audio files moved to the correct s3_path. In the example will be `FieldData/230620_GFAC_CIV_Boss/data/Bossematie/230022/2025-01-21/audios/audible/`
* data_intake regenerated to reflect the changes

2. Select the paths that are ready to process. They are under the `s3_path` column in the data_intake excel.
3. Ensure sensors used in the cycle are registered in `Devices` and `ProjectDevices` inside BalamDB to the corresponding project.
For the cameras its only needed to have one camera with `device_id` column as not null in the `ProjectDevices` table which can be achieved by executing:
```
insert into "Devices" (
serial_number,
brand
) values (
'<could be a serial number generated by biometrio or of the device>',
'<brand>'
)
```
(the current brands registered are: SOLARIS/ AUDIOMOTH/ Song Meter Micro/ Camara RECONYX /Go Pro/ Browning)
followed by:
```
update "ProjectDevices"
set device_id = (
select id from "Devices" where serial_number = '<serial number>'
)
where project_serial_number = '<serial number used by the customer>' and project_id = '<project id>'
```
For the Audiomoths the last 2 queries should be executed for all project audiomoths if there is a missing mapping between `project_serial_number` and `serial_number`.
4. Ensure the latest beadmex docker image version is in the ECR AWS Service
5. Ensure tochtli hast its latest changes from its Github repo. In case they should be updated, an instance in the Cloud9 can be used to push to ECR.
#### Beadmex docker image push to ECR (only needed if there were changes in beadmex docker image)
1. Open the Cloud9 [pufferfish-dev-container](https://eu-west-1.console.aws.amazon.com/cloud9control/home?region=eu-west-1#/environments/cd956259ee6d40a3a94f21251d9bb60c) IDE
2. Execute:
```
cd pufferfish/app/beadmex/
```
3. Update `Dockerfile` and build
```
Dockerfile FROM biometrioearth/dev:beadmex_1.9 change to -> latest version
```
```
docker build --no-cache -t beadmex-app .
```
4. AWS ECR login, tag and push. These are common ECR commands which can be found in the ECR AWS service with the button:

#### Tochtli get latest version (only needed if there were changes in the tochtli Github repository)
1. Open the Cloud9 [pufferfish-dev-container](https://eu-west-1.console.aws.amazon.com/cloud9control/home?region=eu-west-1#/environments/cd956259ee6d40a3a94f21251d9bb60c) IDE
2. Execute:
```
cd pufferfish/app/tochtli/tochtli-origin/
git pull
```
3. Docker build
```
docker build -t tochtli-app .
```
4. AWS ECR login, tag and push. These are common ECR commands which can be found in the ECR AWS service with the button:

### Triggering Metadata Ingestion via Lambda
To start the ingestion process into **Balam** (using `beadmex` and `tochtli`), follow these steps:
1. Access the [Lambda Function](https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/functions/PufferfishApplicationDevStack-SendPathsReadyPufferfish?tab=code)
Open the following Lambda function in the AWS Console:
`PufferfishApplicationDevStack-SendPathsReadyPufferfish`
This function sends the list of paths to be processed.
2. Write the List of Files to Process
Prepare a list of S3 paths in the following format (available in the data_intake excel in the corresponding sheet):
```json
[
"FieldData/240321_FiSt_DEU_In-situ_starter/data/Alt_Madlitz/CA0001/2025-01-07/images_videos/",
"FieldData/240321_FiSt_DEU_In-situ_starter/data/Alt_Madlitz/CA0003/2025-01-07/images_videos/",
"FieldData/240321_FiSt_DEU_In-situ_starter/data/Alt_Madlitz/CA0005/2025-01-07/images_videos/",
"FieldData/240321_FiSt_DEU_In-situ_starter/data/Alt_Madlitz/CA0006/2025-01-07/images_videos/",
"FieldData/240321_FiSt_DEU_In-situ_starter/data/Alt_Madlitz/RE0006/2025-01-07/audios/audible/"
]
```
Click on 'Deploy', and once it's saved, click on 'Test' to run the paths. **The next process is automatic and only needs to be visually monitored.**
3. Background Process (Handled Automatically)
After sending the paths:
- Messages are sent to [SQS](https://eu-west-1.console.aws.amazon.com/sqs/v3/home?region=eu-west-1#/queues) queue **`Pufferfish-ReceptionQueue`**.
- A Lambda function reads from that queue and registers the metadata.
- You don’t need to take any action here — just wait.
If something takes too long or fails, messages will appear in the **Dead Letter Queue**:
`Pufferfish-DLQRegister`
4. If Messages Fail (DLQ Handling)
If there are messages in the DLQ:
1. Go to the **`Pufferfish-DLQRegister`** queue.
2. Click **"Start DLQ redrive"** to retry the failed messages, then click the orange **DLQ redrive** button on the far right.
They will be re-sent to the queue and picked up by the Lambda again.
5. Monitoring via [Step Function](https://eu-west-1.console.aws.amazon.com/states/home?region=eu-west-1#/statemachines/view/arn%3Aaws%3Astates%3Aeu-west-1%3A395847341459%3AstateMachine%3APufferfishAppWorkflow?type=standard)
The ingestion process is orchestrated by the Step Function:
`PufferfishAppWorkflow`
Once the data is registered in the **Reception** table, this workflow will automatically start.
You only need to wait for all jobs to complete.
6. Final Step: Re-run Data Intake
Once all jobs are done:
- Re-run the **data intake** process.
- This checks if all files were processed correctly and if anything is missing.
[Tutorial Video](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/files_ingestion.mp4?csf=1&web=1&e=wXfbDH)
The previous system was developed in AWS CDK with the following [CloudFormation](https://eu-west-1.console.aws.amazon.com/composer/canvas?action=update®ion=eu-west-1&srcConsole=cloudformation&stackId=arn%3Aaws%3Acloudformation%3Aeu-west-1%3A395847341459%3Astack%2FPufferfishApplicationDevStack%2Fbbc616e0-4c9a-11ee-bfca-0a007ac8db1d#), orchestrating the processes [beadmex](https://github.com/biometrioearth/beadmex/tree/develop) and [tochtli](https://github.com/biometrioearth/tochtli).
### Checking ingestion process was successful
In the data_intake excel, numbers:
* for the cameras, the columns num_files (equivalently image_files plus video_files) plus corrupted_files plus daysdeployment should match the reception_files column.
* for the recorders, the columns num_files plus corrupted_files plus daysdeployment should match the reception_files column.
If these numbers don't match then its suggested to check CloudWatch logs of beadmex and tochtli:
select the execution:

in the graph select BatchJob in either SubmitBeadmexJob or in SubmitTochtliJob


select the log stream name

> **Note**: if there are daysdeployment numbers different from 0 then see Reporting section below.
> **Note**: if data_intake previous numbers (`num_files`, `size_gb`, and `number_monitoring_days`, `image_files`, `video_files`, `audio_type`, ` corrupted_files`, `daysdeployment`, `reception_files`) should be recomputed, then the [Pufferfih-MonitorSQS](https://eu-west-1.console.aws.amazon.com/sqs/v3/home?region=eu-west-1#/queues/https%3A%2F%2Fsqs.eu-west-1.amazonaws.com%2F395847341459%2FPufferfish-MonitorSQS) due to changes in S3 bucket or PufferfishDB can be used with the next json template
```
{
"object_name": "FieldData/<path>/" ,
"env_case": "<test or production>",
"shortname": "<shortname>",
"bucket": "<bucket>"
}
```
### If a complement for a directory has been uploaded...
It needs to be ingested into BalamDB. For this case here's an example for the cameras. Assume that the directory
```
s3://be-upload/FieldData/011223_CO2O_IDN_b.e.complete/data/Sirukam/jhondri_nofrial/010007/2025-04-08/images_videos/
```
has been ingested into BalamDB but due to an update from the customer, the complete images and videos camera data is in this location
```
s3://be-upload/FieldData/011223_CO2O_IDN_b.e.complete/data/Sirukam/jhondri_nofrial/010007/2025-05-11/images_videos/
```
so the last path contains what has been previously ingested plus a complement.
For this scenario, locate from the PufferfishDB - `processedfiles` which files were ingested:


So the last image ingested was the 910 and the last video was the 404.
First, make a backup of the path with all the data:
```
s3://be-upload/FieldData/011223_CO2O_IDN_b.e.complete/data/Sirukam/jhondri_nofrial/010007/2025-05-11/images_videos/
```
Second, copy only the images and videos which were not ingested to the path where they should be:


Then to preserve the first json generated with beadmex, rename it inside the path which was ingested:

(in the example it was added a `1st_batch_` string to the json file name).
After the copy is done, launch as normal the directory which was ingested in the 1st place.
```
s3://be-upload/FieldData/011223_CO2O_IDN_b.e.complete/data/Sirukam/jhondri_nofrial/010007/2025-04-08/images_videos/
```
If ingestion was successful then delete the second path with all the data:
```
s3://be-upload/FieldData/011223_CO2O_IDN_b.e.complete/data/Sirukam/jhondri_nofrial/010007/2025-05-11/images_videos/
```
## Reporting
*Requirements:*
> Data already processed and ingested in Balam
> Data intake is up to date: deployment and collection dates are checked,
> processing is done for images-videos, audios paths and error/daysdeployment paths
There are 2 reports to be filled here — the templates for projects: [images_videos](https://biometrioearth.sharepoint.com/:x:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/pufferfish/reports/images_videos_%20shortname_report_template.xlsx?d=w9965220a5d1b477794e82b3c7b6a621e&csf=1&web=1&e=EoKd2n) and [audios](https://biometrioearth.sharepoint.com/:x:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/pufferfish/reports/audios_%20shortname_report_template.xlsx?d=w55b22571fdc948deb7a5676bcbb29dc4&csf=1&web=1&e=hzspIA).
1. Choose the folder with the **shortname** of the project.
2. Create a copy of the template and fill it with the necessary information from the project's **data_intake**.
3. For each report, create a **new page** with the cycle number of the project.
4. Fill the reports
5. For the summary sheet column "Monitoring cycle" deployment and collection dates should be correct. To retrieve the correct collection dates see the section **Data Completion in Balam - Date collection QA**.
#### Audios Report
| Column | Description |
|-----------------|------------------------------------------------------------------|
| `files` | Total number of files in each directory |
| `GB` | Total gigabytes of data in each directory |
| `monitoring days` | Number of monitoring days represented in the directory |
| `corrupted` | Number of corrupted files in each directory |
| `daysdeployment`| Number of files deployed for more than 40 days since the start |
To retrieve min of deployment dates and max of collection dates (after those have been checked are correct) the query to BalamDB can be executed:
```
select min(date_deployment), max(date_collected) from "SamplingPoints" where project_id = '<id of the project>'
and (device_config='audible' or device_config='ultrasonic')
and deployment_iteration=<number of deployment iteration>
```
#### Images-Videos Report
| Column | Description |
|-------------------|------------------------------------------------------------------|
| `files` | Total number of files in each directory |
| `GB` | Total gigabytes of data in each directory |
| `monitoring days` | Number of monitoring days represented in the directory |
| `images` | Number of images in each directory |
| `videos` | Number of videos in each directory |
| `corrupted` | Number of corrupted files in each directory |
| `daysdeployment` | Number of files deployed for more than 40 days since the start |
```
select min(date_deployment), max(date_collected) from "SamplingPoints" where project_id = '<id of the project>'
and device_config is null
and deployment_iteration=<number of deployment iteration>
```
5. In case there are paths with `daysdeployment` files:
They are considered **new paths** and are formatted as:
```
FieldData/<path>/error/daysofdeployment/
```
These paths should contain a subset of the field data. If this is not the case, that is, if these paths have all the field data for a corresponding delivery (in the `FieldData/<path>` directory there's no field data) then before proceeding with their registration to BalamDB, the files should be checked if they are in the correct path (by checking the deployment date directory for example). If this is not the case, the files should be moved to the correct path and relaunched the ingestion process one more time using the correct `FieldData/<path>`.
These new paths are sent as messages to the
[`Pufferfish-StartStepFunction` SQS queue](https://eu-west-1.console.aws.amazon.com/sqs/v3/home?region=eu-west-1#/queues/https%3A%2F%2Fsqs.eu-west-1.amazonaws.com%2F395847341459%2FPufferfish-StartStepFunction) in the following format:
```json
{
"bucket": "be-upload",
"object_name": "FieldData/<path>/error/daysdeployment/",
"shortname": "<shortname>",
"env_case": "<test or production>"
}
```
Each message triggers a new **Step Function** execution, so the number of activated Step Functions matches the number of messages in the queue.
The process waits until all Step Functions are marked as **`SUCCEEDED`** before continuing.
The columns `num_files`, `size_gb`, and `number_monitoring_days` for these new paths are retrieved from the **`michi`** table `pf_ingestdataconsistency`, specifically from the columns:
- `file_path`
- `balam_files`
- `size_gb`
- `number_monitoring_days`
> **Suggested query on Michi:**
```sql
SELECT
file_path,
balam_files,
size_gb,
number_monitoring_days,
image_files,
video_files
FROM pf_ingestdataconsistency
WHERE shortname = '<shortname>'
AND file_path LIKE '%error%'
AND created_at > '2025-01-01'; -- <process date>
```
These results must then be **manually added** to the `data_intake`.
6. Corrupt Files
Files that were not processed due to errors (e.g., could not be opened or are empty) are moved to the following path:
`path/error/corruptedfiles/`
These can be added to the report page by accessing the S3 location:
```
s3://pufferfish-test/report_deployments/<shortname>/production/errors/
```
The file will be named in the following format:
```
corruptedfiles_<deployment_iteration>_<shortname>.csv
```
You should **copy and paste** the contents of this file into the **"Corrupt Files"** section of the corresponding report page.
Alternatively, with next code can be searched corrupted files, either AWS credentials or role attached to instance can be used:
```
import os
from cloudpathlib import S3Client
from cloudpathlib import CloudPath
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""
client = S3Client(aws_access_key_id = AWS_ACCESS_KEY_ID,
aws_secret_access_key = AWS_SECRET_ACCESS_KEY)
PREPEND = "s3://"
BUCKET = ""
PATH = "FieldData/<path>/error/corruptedfiles"
path_s3 = os.path.join(PREPEND, BUCKET, PATH)
root_dir = CloudPath(path_s3, client=client)
pattern = "*.*"
g = root_dir.glob(pattern)
for f in g:
print(f)
```
[Tutorial Video ](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/reporting.mp4?csf=1&web=1&e=zeSVpm)
> **Note**: if data_intake previous numbers (`num_files`, `size_gb`, and `number_monitoring_days`, `image_files`, `video_files`, `audio_type`, ` corrupted_files`, `daysdeployment`, `reception_files`) should be recomputed, then the [Pufferfih-MonitorSQS](https://eu-west-1.console.aws.amazon.com/sqs/v3/home?region=eu-west-1#/queues/https%3A%2F%2Fsqs.eu-west-1.amazonaws.com%2F395847341459%2FPufferfish-MonitorSQS) due to changes in S3 bucket or PufferfishDB can be used with the next json template
```
{
"object_name": "FieldData/<path>/error/daysdeployment/" ,
"env_case": "<test or production>",
"shortname": "<shortname>",
"bucket": "<bucket>"
}
```
**📌 After Reporting**
Remember: if there is no more data coming for the current cycle, make sure that all the `SamplingPoints.metadata ->> uploaded_files` fields are set to `"true"`.
This ensures that data for the next cycles is not mistakenly uploaded to a previous one.
```sql
UPDATE "SamplingPoints"
SET metadata = jsonb_set(metadata, '{uploaded_files}', 'true'::jsonb)
WHERE project_id = (
SELECT id FROM "Projects"
WHERE shortname = <shortname>
)
AND deployment_iteration = <deployment_ iteration number>;
```
# 🧩 Data Completion in Balam
## Nodes
Ensure that the distance_km is greater than 1.2 km using the following query on balam_db, as they are sequentially numbered:
```
SELECT
sa1.identifier AS sampling_area_id_1,
sa2.identifier AS sampling_area_id_2,
ST_Distance(
ST_Transform(sa1.center_of_area, 3857),
ST_Transform(sa2.center_of_area, 3857)
) / 1000 AS distance_km
FROM
"SamplingAreas" sa1
JOIN
"SamplingAreas" sa2 ON sa1.id < sa2.id -- Avoids duplicate pairs and self-distance
JOIN
"Projects" p1 ON sa1.project_id = p1.id
WHERE
p1.shortname = '<shortname>'
and
sa2.project_id='<project_id>'
ORDER BY
distance_km;
```
If a pair of nodes is less than 1.2 km apart, merge these nodes by replacing the identifier in SamplingPoints with the smaller identifier and deleting that SamplingArea from the table.
## Landcover
*Requirements*
> "SamplingPoints" for at least one deployment exist
0. If you're using Cloud9 [**pufferfish-dev-container**](https://eu-west-1.console.aws.amazon.com/cloud9control/home?region=eu-west-1#/), you can access everything at:
```bash
cd pufferfish/app/procedures/
```
**Note:** If the Docker image doesn't exist yet, build it with:
```bash
docker build -t landcover_geojson .
```
Then, continue with the next steps.
**Task 1**: Fetch the GeoJSON file for the project.
- Use the following command to fetch the GeoJSON file from S3:
```bash
docker run --rm -v ~/.aws:/root/.aws landcover_geojson get_geojson <shortname> <case>
```
Replace `<shortname>` with the project shortname and `<case>` with `test` or `production`.
This will upload a file to S3
```
s3://pufferfish-test/landcovers/<shortname>/
```
1. **Notify RS Team**: Inform the RS team that there is a new GeoJSON file located at:
[SharePoint - RS-GIS Collaboration](https://biometrioearth.sharepoint.com/sites/RS-GIS/Freigegebene%20Dokumente/Forms/AllItems.aspx?id=%2Fsites%2FRS%2DGIS%2FFreigegebene%20Dokumente%2FGeneral%2F10%5FCollaboration%2F02%5FData%5FManagement&viewid=c659de1d%2Dde93%2D47af%2D9f16%2Df094e8ae4cbd)
3. **Receive Updated GeoJSON**: The RS team will provide the updated GeoJSON file with the `class` field that refers to the landcover information.
4. **Verify GeoJSON File Name**:
- Ensure the file name includes the word `looking` (e.g., `example_looking.geojson`).
- Upload the updated GeoJSON file to the following S3 location:
```
s3://pufferfish-test/landcovers/<shortname>/
```
**Task 2**: Check for Missing Landcovers in the Database.
- Run the following command to verify if any landcover classes are missing in the `Ecosystems` table:
```bash
docker run --rm -v ~/.aws:/root/.aws landcover_geojson landcover_check_db <shortname> <case>
```
If there are missing classes, add them to the `Ecosystems` table in the `Balam` database.
Example:
`INSERT INTO "Ecosystems" (name) values ('Shrubland')`
**Task 3**: Fill the `environment_id` Field in the `SamplingPoints` Table.
- Use the following command to populate the `environment_id` field for missing entries in the `SamplingPoints` table:
```bash
docker run --rm -v ~/.aws:/root/.aws landcover_geojson fill_missing_landcovers <shortname> <case>
```
**Reference**: Refer to the GitHub repository for detailed instructions:
[GitHub - Pufferfish Procedures](https://github.com/biometrioearth/pufferfish/tree/develop/app/procedures)
## Collection date QA
*Requirements*
> BalamDB credentials in .balam_db file where date_collected.py is run
> AWS credentials to read json in be-upload path or role attached to instance can be used
> [pewee_balam_models.py](https://github.com/biometrioearth/sandbox-scripts/blob/main/scripts/pewee_balam_models.py)
For completing or fixing the collection date of sensors according to the last file taken by the devices use the following script [date_collected_qa.py](https://github.com/biometrioearth/sandbox-scripts/blob/main/scripts/date_collected_qa.py) which requires the parameters:
* Field data path
* Sampling Point ID
and installing with `pip`:
```
cloudpathlib[s3]
peewee
tzfpy
loguru
```
```
python3 date_collected_qa.py --field_data_path "FieldData/<path>/" --sampling_point_id "<id sampling point>"
```
[Tutorial videos](https://biometrioearth.sharepoint.com/:f:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/date_collection_qa?csf=1&web=1&e=AKLIqy)
## Fixing file timezone inside file_metadata of Files table
*Requirements*
> BalamDB credentials
> AWS credentials to read json in be-upload path or role attached to instance can be used
File timezone could be incorrectly stored in file_metadata of Files table in BalamDB as neither beadmex (before the version 2.0) nor tochtli were handling this. Therefore its needed the script [fix_filedate_not_reading_json.py](https://github.com/biometrioearth/sandbox-scripts/blob/main/scripts/fix_filedate_not_reading_json.py) which require the file id of the file which file timezone will be fixed.
```
python3 fix_filedate.py --file_id $file_id
```
Beadmex docker image latest version can be used installing the cloudpathlib[s3] with `pip`.
# ⚙️ Configuration Checks
## Audio Configuration Check (Lambda)
[Tutorial Video](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/check_configs/check_configfiles_task1.mov?csf=1&web=1&e=3TJsNg)
*Purpose*
Generate the **config CSV** for recent audio uploads and, once reviewed by the audio expert, **move the checked raw audio** to its final location.
## Requirements
> AWS access to Lambda & S3
> Access to the SharePoint project folder
> Project: `95_KI_Nationalpark`
## Task 1 — Build config CSV for recent uploads
This uses the Lambda function **`check_configfiles`** with action `build_config_csv`.
1. **Open Lambda**
Lambda function: [**check_configfiles**](https://eu-west-1.console.aws.amazon.com/lambda/home?region=eu-west-1#/functions/check_configfiles?tab=code)
2. **Create a new Test Event**
- Click **Test** → **Create new event** → give it a descriptive name (e.g., `build_config_csv_<YYYYMMDD>`).
- Paste the payload below and **update `created_after`** with the last date we created data (see next step):
~~~json
{
"action": "build_config_csv",
"source_bucket": "be-upload",
"project_name": "95_KI_Nationalpark",
"created_after": "2025-09-16 14:08:45.509764+00"
}
~~~
3. **Find the correct `created_after` timestamp**
Go to the [SharePoint testdata ](https://biometrioearth.sharepoint.com/sites/b.e/Freigegebene%20Dokumente/Forms/AllItems.aspx?id=%2Fsites%2Fb%2Ee%2FFreigegebene%20Dokumente%2Fbiometrio%2Eearth%2FProjects%2FRegion%20Europe%2FOngoing%2FBMUV%5FKI%5FLeuchtt%5F%20Nat%5FNaturlandschaften%5FNPs%2FWorkspace%2FAudio%2FTestdata&viewid=6e884511%2D434b%2D4e95%2Db733%2D8ca28ebae94f)folder and open the **latest Excel we shared**. Copy the date from that file and use it as `created_after`.
4. **Run the event**
Click **Test** to execute. Confirm **Status: Succeeded**.
5. **Locate the output CSV**
- S3 path (folder will be created if missing):
[s3://pufferfish-test/check_config/95_KI_Nationalpark/](https://eu-west-1.console.aws.amazon.com/s3/buckets/pufferfish-test?region=eu-west-1&bucketType=general&prefix=check_config/95_KI_Nationalpark/&showversions=false)
6. **Upload the CSV to SharePoint**
Upload the generated file to the same SharePoint workspace used by the audio expert (same area as the “Testdata”).
**Expected result**
- A new CSV appears under `check_config/95_KI_Nationalpark/` in the `pufferfish-test` bucket, containing configs for files uploaded **after** `created_after`.
**Notes / Tips**
- If nothing is found, verify:
- `created_after` isn’t too recent (try an earlier timestamp).
- `project_name` exactly matches: `95_KI_Nationalpark`.
- The **source bucket** is correct: `be-upload`.
- Check **CloudWatch logs** from the Lambda for any errors.
---
## Task 2 — Move audio that the expert has approved
[Tutorial Video](https://biometrioearth.sharepoint.com/:v:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/references/FDE_operations_manual/Tutorials/check_configs/check_configfiles_task2.mov?csf=1&web=1&e=aNe8Go)
After the audio expert reviews the CSV and gives a **cutoff**, run `move_audio_raw`.
1. **Get the cutoff**
The audio expert provides the `cutoff` value (epoch-ms string). Keep it as-is.
2. **Create a new Test Event (dry run first)**
In the same Lambda (`check_configfiles`), create a new event (e.g., `move_audio_raw_dryrun_<YYYYMMDD>`) with:
~~~json
{
"action": "move_audio_raw",
"bucket": "be-upload",
"base_prefix": "FieldData/95_KI_Nationalpark/data",
"cutoff": "1758031663165",
"dry_run": true
}
~~~
3. **Run dry run & verify**
Click **Test**. Confirm in the result/logs that the listed operations look correct (**no changes are actually made** in dry run).
If paths or counts look off, stop and clarify with the audio expert.
4. **Execute the move (real run)**
Duplicate the event, set `"dry_run": false`, and run again:
~~~json
{
"action": "move_audio_raw",
"bucket": "be-upload",
"base_prefix": "FieldData/95_KI_Nationalpark/data",
"cutoff": "1758031663165",
"dry_run": false
}
~~~
5. **Post-checks**
- Confirm success in the Lambda result and **CloudWatch logs**.
- Spot-check a few files at their new location (as defined by the Lambda logic for “moved” paths).
**Safety checks**
- Always run **dry_run = true** once before the real move.
- Never modify the `cutoff`—use exactly what the audio expert provides.
- If the move affects fewer/more files than expected, **pause** and re-confirm the cutoff and `base_prefix`.
---
### FAQ
- **Where does the config CSV go?**
`s3://pufferfish-test/check_config/95_KI_Nationalpark/` (then upload to SharePoint).
- **Where do I get `created_after`?**
From the latest Excel in the project’s SharePoint **Testdata** folder (copy the timestamp used there).
- **What if Lambda says succeeded but I don’t see a file?**
Ensure `created_after` isn’t excluding everything; try an earlier timestamp and re-run.
# Next Steps
1. Pufferfish-Client and Pufferfish AWS connection through API Gateway
2. Each project located in a bucket in the same region as the project.
- Create Region Buckets
- Add column on "Projects" table on balam that include region to take it as bucket_name
- Change lambda, tochtli, beadmex that refer as be-upload bucket to take the name from balam "Projects"
3. Determine data movement policy: The discussion on when a project is considered completed can be complex and will depend on several factors. Some points to consider could include:
- Value Delivery:
- Have the project's objectives been met, and has the expected value been delivered to the client?
- Have all tasks and activities agreed upon in the project scope been completed?
- Have the expected results, such as reports, analyses, or final products, been provided to the client?
- Completion of Reports:
- Have all required reports and analyses been completed and delivered to the client?
- Has the client confirmed satisfaction with the results, and are no further changes or revisions required?
- Data Selection:
- Have relevant data been selected and saved for future analyses or reference?
- Have measures been taken to ensure the integrity and availability of stored data?
- When a project has been completed, move the data of the project to Glacier or Glacier Deep Dive.
- If the data needs to be accessed occasionally with retrieval times ranging from minutes to a few hours, and cost-effectiveness is a priority, data move to Glacier.
- If the data is rarely accessed and can tolerate longer retrieval times of 12 to 48 hours, and the primary focus is on minimizing storage costs, data move to Glacier Deep Archive.
# Helpful links
* Remote Sensing
* [remote sensing map](https://remotesensing.services.biometrio.earth/)
* [remote sensing api](https://remotesensing.services.biometrio.earth/api/apidoc/)
* Pufferfish repositories
* [pufferfish](https://github.com/biometrioearth/pufferfish/tree/develop)
* [beadmex](https://github.com/biometrioearth/beadmex/tree/develop)
* [tochtli](https://github.com/biometrioearth/tochtli)
* Data Intakes
* [data intakes](https://biometrioearth.sharepoint.com/:f:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/Freigegebene%20Dokumente/dm/pufferfish/data?csf=1&web=1&e=ssanGg)
* Rules and policies
* [Ingestion](https://biometrioearth.sharepoint.com/:w:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/_layouts/15/Doc2.aspx?action=editNew&sourcedoc=%7B0912e472-ebb0-424a-a547-dd7224dd012e%7D&wdOrigin=TEAMS-ELECTRON.teamsSdk_ns.bim&wdExp=TEAMS-CONTROL&wdhostclicktime=1717753224754&web=1)
* [storage](https://biometrioearth.sharepoint.com/:w:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/_layouts/15/Doc2.aspx?action=editNew&sourcedoc=%7Bea7990f9-f09d-453c-9403-9fc3254f88b4%7D&wdOrigin=TEAMS-ELECTRON.teamsSdk_ns.bim&wdExp=TEAMS-CONTROL&wdhostclicktime=1718096855897&web=1)
* Projects
* [projects info](https://biometrioearth.sharepoint.com/:x:/r/sites/data-pro-and-syss-dev-daily-stand-ups-dm/_layouts/15/Doc2.aspx?action=edit&sourcedoc=%7B800cd8ef-8ece-478d-bacf-af2063cb4d7c%7D&wdOrigin=TEAMS-ELECTRON.teamsSdk_ns.bim&wdExp=TEAMS-CONTROL&wdhostclicktime=1718100314159&web=1)