Data Processing

1. data processing

# Example: load a dataset included in the package

# ring measurement
dt.samples <- fread(system.file("extdata", "dt.samples.csv", package = "growthTrendR"))

# formatting the users' data conformed to CFS-TRenD data structure
dt.samples_trt <- CFS_format(data = list(dt.samples, 39:68), usage = 1, out.csv = NULL)
class(dt.samples_trt)
#> [1] "cfs_format"

# save it to extdata for further use
# saveRDS(dt.samples_trt, "inst/extdata/dt.samples_trt.rds")

arguments of CFS_format() function:

data

All information should be provided in a single file in wide format, with metadata first, followed by the ring-width measurements (in mm).

The column names for the ring-width measurements can follow two formats to indicate the year of measurement: Directly as the year (e.g., 1980) or Prefixed with a character (e.g., X1980).

It is highly recommended that the ring-width measurement columns are ordered by year and consecutive in the dataset, as the column indices will be used as input for the function CFS_format().

data = list (dt.samples, 39:68)), the second item refers to the column indices

usage

If users intend to submit their data to the CFS-TRenD online repository, set usage = 1 in the function. This will enable the function to format the data structure and perform detailed checks, including column names, geographic coordinates, species, and other requirements to conform to the CFS-TRenD collection standards.

Otherwise, use usage = 2 to perform a reduced checking procedure, which still builds the CFS-TRenD structure but skips some of the detailed validations.

out.csv

if user wants to export the processed tables in csv format, specify the folder here. the default is NULL.

Note: Running the function CFS_format() is the first and mandatory step before using any other functions in the growthTrendR package. The data provided in this tutorial is already prepared to run the vignette; in practice, users may need to add or modify their own data based on the messages generated by the function.

2. generate data report:

The data report provides an overview of the tree ring data’s quality and characteristics at four levels: project, project-species, project-species-site, and project-species-radii, including the quality assessment at site and radii levels with the default parameters. More details on quality assessment will be presented next section.


generate_report(robj = dt.samples_trt, qa.label_data = "demo-samples ", data_report.reports_sel = c(1,2,3,4), qa.min_nseries = 5, scale.max_dist_km = 200, scale.N_nbs = 2)

arguments of the generate_report() function:

robj

The input for the data report is the output of the CFS_format() function, which assigns the class “cfs_format” to the resulting object.

qa.label_data

A short description of the input dataset. This text will appear in the report as the data source for the generated figures.

data_report.reports_sel

This argument specifies the level of data summaries to be included in the reports. Valid options are 1, 2, 3, or 4, each corresponding to one of the four available report types. In this tutorial, we demonstrate only the project–species level summary.

output_file

This argument allows users to export the HTML-formatted report to a specified location by providing a folder and filename (e.g., “path/to/report.html”). If left as NULL (default), the report will not be saved to disk and will instead open directly in the browser for viewing.

Data summary report

This report provides an overview of the tree ring data’s quality and characteristics at four levels:

1. project: Data Completeness: Assessment of missing or incomplete data of the whole data;
2. project-species: Data Summary: Summary statistics and descriptions;
3. project-species-site: data summary tables and series graphing; 
4. project-species-site-radii: Correlation Analysis and quality assessment.

project name: Douglas-fir retrospective monitoring

selected reports: 1, 2, 3, 4

data completeness

This table presents the completeness of each variable of the whole dataset as a percentage. A value of 0 indicates no effective data. Please carefully verify that all required data has been included in the submission.

var	pct
tr1_submission_id	100
tr1_project_name	100
tr1_description	100
tr1_year_range	100
tr1_reference	100
tr1_open_data	100
tr1_contact1	100
tr1_contact2	100
tr2_site_id	100
tr2_latitude	100
tr2_longitude	100
tr2_datasource	100
tr2_investigators	100

var	pct
tr2_province_iso_code	100
tr3_tree_id	100
tr3_species	100
tr4_meas_no	0
tr4_meas_date	100
tr4_status	100
tr4_dbh_cm	100
tr4_ht_tot_m	100
tr5_sample_id	100
tr5_sample_type	100
tr5_sample_ht_m	0
tr5_sample_diameter_cm	0

var	pct
tr6_radius_id	100.0
tr6_cofecha_id	0.0
tr6_ring_meas_method	100.0
tr6_crossdating_visual	100.0
tr6_crossdating_validation	100.0
tr6_age_corrected	0.0
tr6_bark_thickness_mm	0.0
tr6_radius_inside_cm	100.0
tr6_dtc_measured_mm	100.0
tr6_dtc_estimated_mm	22.2
tr6_rw_ystart	100.0
tr6_rw_yend	100.0
tr6_comments	0.0

Data Summary (Species)

This section presents key summary statistics, including spatial and temporal ranges, summary of ring width measurements, series length, etc., categorized by species.

In this dataset, there’s 1 species: PSEUMEN

*Number of series that passed the test of CFS_qa() on differentiated series

**The values refer to mean ± sd (min, max)

site-level data summary

This section presents site-level data summaries, including a figure showing ring width measurements over time and a table with key statistics.

PSEUMEN on ring width measurement

This table provides a site-level summary, including the series length, raw ring width measurements, the median ring width, the median ring width of its 2 closest neighbors, and the ratio between them. This offers insights into potential outliers caused by scaling issues.

site_id	lon	lat	Nb. trees	len.series**	rw(mm)**	rw.median_tgt	rw.median_nbs	ratio_median
PSEUMEN
X003b	-123.9337	48.96656	3	30 ± 0 ( 30, 30 )	1.5 ± 0.2 ( 0.6, 3.7 )	1.38	2.52	0.55
X005c	-124.7485	49.42990	3	30 ± 0 ( 30, 30 )	4 ± 0.8 ( 1.4, 9 )	3.84	1.80	2.13
X011c	-125.5210	50.04691	3	30 ± 0 ( 30, 30 )	2 ± 0.3 ( 0.9, 3.5 )	1.97	2.15	0.92

*ratio between median of rw of the site and median of rw of its 2 nearest neighbors.

**The values refer to mean ± sd (min, max)

Correlation and quality assessment code

This table provides a summary for each series, including raw ring width measurements, autocorrelation, correlation with chronologies for both raw and differentiated data, and quality assessment code (qa_code) which was derived from the CFS_qa() function. The chronologies includes all the series with qa_code ‘pass’.

qa_code	Description
Description of qa_code
pass	The maximum correlation occurs at lag 0
borderline	The correlation at lag 0 ranks as the second highest, and its difference from the maximum remains within a predefined threshold, categorizing as a quasi-pass
pm1	The maximum correlation occurs at lag 1 or -1, suggesting slight misalignment.
highpeak	The maximum correlation occurs at a non-zero lag and is more than twice the second-highest value, potentially signaling an issue
fail	All other measurements that do not fit into the aforementioned categories fall under this classification.

site_id	radius_id	year from	year to	len series	raw AR1*	raw corr_mean*&	trt AR1**	trt corr_mean**&	qa_code**%
PSEUMEN
X003b	X003_101_004	1991	2020	30	0.74	0.83 ( 0 )	-0.28	0.6 ( 0 )	pass
X003b	X003_101_005	1991	2020	30	0.61	0.7 ( 0 )	-0.41	0.56 ( 0 )	pass
X003b	X003_101_008	1991	2020	30	0.83	0.85 ( 0 )	-0.40	0.67 ( 0 )	pass
X005c	X005_101_003	1991	2020	30	0.88	0.87 ( 0 )	0.07	0.72 ( 0 )	pass
X005c	X005_101_004	1991	2020	30	0.78	0.91 ( 0 )	-0.02	0.72 ( 0 )	pass
X005c	X005_101_005	1991	2020	30	0.76	0.93 ( 0 )	-0.38	0.74 ( 0 )	pass
X011c	X011_101_005	1991	2020	30	0.64	0.65 ( 0 )	-0.16	0.55 ( 0 )	pass
X011c	X011_101_007	1991	2020	30	0.66	0.75 ( 0 )	-0.17	0.5 ( 0.01 )	pass
X011c	X011_101_008	1991	2020	30	0.61	0.47 ( 0.01 )	0.00	0.51 ( 0 )	pass

*developed from raw series

**developed from differentiated series

&correlation with chronologies, the value represents correlation (p-value)

%qa_code is identified using the current data as reference dataset