NAME
    GHCN::StationTable - collect station objects and weather data

SYNOPSIS
      use GHCN::StationTable;

      my $ghcn = GHCN::StationTable->new;

      my ($opt, @errors) = $ghcn->set_options(
        user_options => {
            country     => 'US',
            state       => 'NY',
            location    => 'New York',
            report      => 'yearly',
        },
      );
      die @errors if @errors;

      $ghcn->load_stations;

      # generate a list of the stations that were selected
      say $ghcn->get_stations( kept => 1 );

      if ($opt->report) {
          say $ghcn->get_header;

          $ghcn->load_data();
          $ghcn->summarize_data;

          say $ghcn->get_summary_data;
          say $ghcn->get_footer;
      }

DESCRIPTION
    The GHCN::StationTable module provides a class that is used to fetch
    stations information from the NOAA Global Historical Climatology Network
    database, along with temperature and/or precipitation records from the
    daily historical records.

    For a more comprehensive example than the above Synopsis, see the
    section EXAMPLE PROGRAM.

    Caveat emptor: incompatible interface changes may occur on releases
    prior to v1.00.000. (See VERSIONING and COMPATIBILITY.)

    The module is primarily for use by modules GHCN::Fetch.

FIELD ACCESSORS
    opt_obj
        Returns a reference to the Options object created by set_options.

    opt_href
        Returns a reference to a hash of the Options created by set_options.

    config_file
        Returns the name of the configuration file, if one was passed to
        set_options.

    config_href
        Returns a reference to a hash containing the configuration options
        set by set_options (if any).

    stn_count
        Returns a count of the total number of stations found in the station
        list.

    stn_selected_count
        Returns a count of the number of stations that were selected for
        processng.

    stn_filtered_count
        Returns a count of the number of stations that were selected for
        processing, excluding those rejected due to errors or other
        criteria.

    missing_href
        Returns a hash of the missing months and days for the selected data.

METHODS
  new ()
    Create a new StationTable object.

  export_kml( list => 0 )
    Output the coordinates of the station collection as a KML file, for
    import into Google Earth as placemarks. The active range of each station
    will be included as timespans so that you can view the placemarks across
    time.

    argument: list
        If the argument list contains the 'list' keyword and a true value,
        then export_kml will return a string with the kml output as lines of
        text rather than writing it to the file specified by the kml option.

    option: kml <filespec>
        Write the kml output to the file designated by <filespec>. If
        <filespec> is an empty string, no file is written.

    option: color <str>
        A color name, one of blue, green, azure, purple, red, white or
        yellow. Only the first character is recognized, so 'b' and 'bob'
        both result in blue. All colors are given an opacity of 50 (the
        range is 00 to ff).

  flag_counts ()
    The load_stations() and load_data() methods may reject a station or a
    particular data entry due to quality or other issues. These decisions
    are kept in a hash field, and a reference to that hash is returned by
    this method. The caller can then report the values.

  get_flag_statistics ( list => 0, no_header => 0 )
    Gets a header row and summary table of data points that were kept and
    rejected, along with counts of QFLAGS (quality flags). Returns
    tab-separated text, or a list if the list argument is true. A heading
    line is provided unless no_header is true.

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list is returned rather than tab-separated lines of text. Defaults
        to false.

    argument: no_header => <bool>
        If the arguments include the 'no_header' keyword and a true value,
        then the return value will not include a header line. Default is
        false.

  get_footer( list => 0 )
    Get a footing section with explanatory notes about the output data
    produced by detail and summary reports.

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list is returned rather than tab-separated lines of text. Defaults
        to false.

  get_hash_stats ( list => 0, no_header => 0 )
    Gets the hash sizes collected during the execution of StationTable
    methods, notably load_stations and load_data, as tab-separated lines of
    text.

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list is returned rather than tab-separated lines of text. Defaults
        to false.

    argument: no_header => <bool>
        If the arguments include the 'no_header' keyword and a true value,
        then the return value will not include a header line. Default is
        false.

  get_header ( list => 0 )
    The weather data obtained by the laod_data() method is essentially a
    table. Which columns are returned depends on various options. For
    example, if report => monthly is given, then the key columns will be
    year and month -- no day. If the precip option is given, then extra
    columns are included for precipitation values.

    This variabiliy makes it difficult for a consumer of these modules to
    emit a heading that matches the underlying columns. The purpose of this
    method is to return a set of column headings that will match the data.
    The value returned is a tab-separated string.

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list is returned rather than tab-separated lines of text. Defaults
        to false.

  get_missing_data_ranges( list => 0, no_header => 0 )
    Gets a list, by station id and year, of any months or day ranges when
    data was found to be missing. Missing data can lead to incorrect
    interpretation and can cause a station to be rejected if the percent of
    found data does not meet the -quality threshold (normally 90%).

    Returns a heading line followed by lines of tab-separated strings.

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list of lists (stations containing years) is returned rather than
        tab-separated lines of text. Defaults to false.

    argument: no_header => <bool>
        If the arguments include the 'no_header' keyword and a true value,
        then the return value will not include a header line. Default is
        false.

    option: report <daily|monthly|yearly|id>
        Determines the number and content of heading values.

  datarow_as_hash ( $row_aref )
    This is a convenience method that may be used to convert table rows
    returned by the row_sub callback subroutine of load_data from a perl
    list into a hash. It automatically calls get_header to get the headers
    for the table data. When you pass it a reference to a data row (obtained
    vis the row_sub callback routine given to load_data) it combines the
    elements of the data row list with the column headings and returns a
    hash.

  get_missing_rows( list => 0 )
    In support of a -nogaps option, to generate detail output that does not
    have any gaps due to missing data, this method gets a list of rows for
    the months and days that had missing data for a given station id in a
    given year.

    Returns lines of tab-separated strings.

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list is returned rather than tab-separated lines of text. Defaults
        to false.

    option: nogaps
        Emits extra rows after the detail data rows to make up for missing
        months or days. This is primarily so that if the data is charted by
        date, then the x-axis will have all the dates from start to finish.
        Otherwise, the chart and any trends that are projected on it will be
        distorted by the missing data.

  get_options ( list => 0, no_header => 0 )
    Get text which shows the options that were in effect for this processing
    run, in a Getopt style. Includes a heading and a footing with
    explanatory notes. If argument 'list' is true, returns the lines as a
    list. Line [1] contains the options string.

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list is returned rather than tab-separated lines of text. Defaults
        to false.

    argument: no_header => <bool>
        If the arguments include the 'no_header' keyword and a true value,
        then the return value will not include a header line or the
        explanatory footing notes. Default is false.

  get_stations ( list => 0, kept => 1, no_header => 0 )
    Return lines of text with tab-separated columns describing each of the
    stations for stations that were found to meet the filtering criteria
    specified in the user options.

    argument: kept => <bool>
        If the argument kept => 0 is specified, and load_data has already
        been invoked, then the stations which were rejected due to quality
        flags or missing data will be returned. If kept => 1 is specified,
        then the stations that were kept will be returned.

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list is returned rather than tab-separated lines of text. Defaults
        to false.

    argument: no_header => <bool>
        If the arguments include the 'no_header' keyword and a true value,
        then the return value will not include a header line. Default is
        false.

  get_station_note_list ()
    Return a list consisting of tab-separated code/description pairs that
    rejected stations were flagged with; i.e. the reasons for their
    rejection.

  get_summary_data ( list => 0 )
    Gets a list of summarized the temperature or precipitation data by day,
    month or year depending on the report option.

    Returns undef if the report option is 'id'.

    The actual columns that are returned is dictated by the report option
    and by the tavg and precip options provided when the object was
    instantiated by new().

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list is returned rather than tab-separated lines of text. Defaults
        to false.

    option: report <daily|monthly|yearly>
        Determines the level of summarization.

    option: range <rangelist>
        If the range option is provided, the output rows are restricted to
        those years that are within the specified range(s).

  get_timing_stats ( list => 0 )
    Get a list of the timers, with durations and notes, in alphabetical
    order by timer label.

    argument: list => <bool>
        If the arguments include the 'list' keyword and a true value, then a
        list is returned rather than tab-separated lines of text. Defaults
        to false.

  has_missing_data ()
    Returns true if any missing data was detected amongst the stations that
    were processed. The calling script can use this to decide whether to
    issue a warning to the user. A list of missing data specifics can be
    sent to the output by calling method get_missing_data_ranges.

  load_data ( progress_sub => undef, row_sub => sub { say @_ } )
    Load the daily weather data for each of the stations that are were
    loaded into the collection. Print the data if option report id is given.
    Otherwise cache the data so it can be aggregated at a later step.

    argument: progress_sub => undef
        As fetching and parsing each daily data page can take some time, an
        optional callback hook is provided so the caller can emit a progress
        message before each station's data is loaded; e.g. progress => sub{
        say {STDERR} @_ }.

    argument: row_sub => sub { say @_ }
        Optional callback hook to allow the caller to provide their own
        subroutine for printing (or collecting in a list, or both) the
        row-level station data that is fetched when the report option is
        'id'. Defaults to printing via the 'say' operator.

    option: report <id|daily|monthly|yearly>
        When report id is specified, the weather data for each station is
        printed immediately (via the row_sub callback hook).

        For all other report options, the data is fetched from each station
        and kept in a cache so that it can be aggregated by invoking
        summarize_data(). The row_sub hook is not invoked.

  load_stations ()
    Read the GHCN stations list and the stations inventory list and create a
    hash of Station objects, keyed on station id, filtered according to the
    options provided in set_options().

    Returns a hash of GHCN::Station objects, keyed on station id.

    option: country <str>
        Selects only those stations that match the 2-digit GEC (formerly
        FIPS) country code or that uniquely match the name or partial name
        given in <str>.

    option: state <code>
        Selects only those stations that match a US state or Canadian
        provinc code.

    option: location <str>
        Selects only those stations with a name that matches the specified
        pattern, which can be either a station id, or a comma-separated list
        of station id's, or a regex. If a regex, then it is anchored on the
        left and whitespace is NOT ignored.

    option: gps <latitude,longitude>
        This option selects stations within a certain radius of the
        designated latitude and longitude, expressed as positive and
        negative numbers (not using N, S, W, E designators).

    option: radius <int>
        In conjunction the gps options, determines the radius in kilometers
        for the search area. Defaults to 25 km.

    option: gsn
        Select only GCOS Surface Network stations, which is a baseline
        network comprising a subset of about 1000 stations chosen mainly to
        give a fairly uniform spatial coverage from places where there is a
        good length and quality of data record. See
        "/www.ncdc.noaa.gov/gosic/global-climate-observing-system-gcos/g
        cos-surface-network-gsn-program-overview" in https:

  ($opt, @errors) = set_options ( %args )
    Set various options for this StationTable instance. These options will
    affect the processing and output by subsequent method calls.

    Returns an Option object and a list of errors. It is advised you check
    @errors after calling set_options cease processing; e.g. *die @errors if
    @errors*.

    You may want to set up a file-scoped lexical variable to hold the
    options object. That way it is accessible throughout your code. The
    typical calling pattern would look like this:

        my $Opt;  # a file-scope lexical

        sub run (@ARGV) {
            my $ghcn = GHCN::StationTable->new;

            my @errors;
            ($Opt, @errors) = set_options(...);
            die @errors if @errors;
            ...
    }

    user_options => \%user_options
        This optional argument provides a reference to a hash that contains
        a set of options that will control the filtering, processing and
        output of the GHCN modules. This hash is typically created by the
        caller using Getopt::Long.

        The options provided can be any subset of the supported options. Any
        option not provided will be added with an appropriate default value.
        The resulting combined option collection will be available as both
        as hash reference in the instance, and as a Hash::Wrap object
        reference in the instance via methods.

        If empty or undef, a list of all stations in the GHCN database will
        be generated, so it's best to at least provide some country or
        station id filtering, and absolutely necessary in order to produce
        other output such as daily or monthly weather data (by specifying
        -report).

        See USER OPTIONS for a list of the options available.

    config_file => $config_filespec
        This optional argument specifies a file which will be used to set
        the configuration options. The file must contain YAML specifications
        that describe the hash structure defined in section CONFIGURATION
        OPTIONS.

        This option is an alternative to config_options. (If both options
        are specifed, then config_options will take precedence.)

        If config_filespec is an empty string, then the filespec will
        default to $HOME\ghcn_fetch.yaml (%UserProfile% on Windows).

        If config_filespec is undef, then an empty configuration will be
        used; i.e. there will be no cache and no aliases.

    config_options => \%config_options
        This optional argument is a reference to a hash containing
        configuration options as described in section CONFIGURATION OPTION.
        Alternatively, config_file can be used to specify a file containing
        the configuation specification in YAML format.

    stnid_filter => \%stnid_filter
        This optional argument should be a reference to a hash whose keys
        are the specific station id's which are to be fetched and processed.
        When this is used, many filtering options via %opt will be
        overridden (e.g. -country).

    timing_stats => $TimingStats_obj
        This optional argument should point to a TimingStats object that was
        created by the caller and will be used to collect timing statistics.

    hash_stats => \%hash_stats
        This optional argument should be a reference to a hash that was
        created by the caller and will be used to collect performance and
        memory statistics.

    return_list => <bool>
        By default, get methods return a tab-separated string of results. If
        return_list is set to true, then these methods will return a list
        (or list of lists).

  summarize_data ()
    Aggregate the daily weather data for the stations that were loaded,
    according to the report option.

    option: report => 'daily|monthly|yearly'
        When the report option is 'id', no summarization is needed and the
        method immediately returns undef.

  tstats ()
    Provides access to the TimingStats object so the caller can start and
    stop script-level timers.

  DOES
    Defined by Object::Pad. Included for POD::Coverage.

  META
    Defined by Object::Pad. Included for POD::Coverage.

EXAMPLE PROGRAM
      use GHCN::StationTable;

      my $ghcn = GHCN::StationTable->new;

      my ($opt, @errors) = $ghcn->set_options(
        user_options => {
            country     => 'US',
            state       => 'NY',
            location    => 'New York',
            active      => '2000-2022',
            report      => 'yearly',
            nonetwork   => -1,      # refresh cache if stale this year
        },
        config_options => {
            cache => {
                root => 'c:/ghcn_cache',
                namespace => 'ghcn',
            },
        },
      );

      die @errors if @errors;

      $ghcn->load_stations;

      my @rows;
      if ($opt->report) {
          say $ghcn->get_header;

          # this also prints detailed station data if $opt->report eq 'id'
          $ghcn->load_data(
            # set a callback routine for printing progress messages
            progress_sub => sub { say {*STDERR} @_ },
            # set a callback routine for capturing data rows when report => 'id'
            row_sub      => sub { push @rows, $_[0] },
          );

          # these only do something when $opt->report ne 'id'
          $ghcn->summarize_data;
          say $ghcn->get_summary_data;

          say '';
          say $ghcn->get_footer;

          say '';
          say $ghcn->get_flag_statistics;
      }

      # print data rows collected by row_sub callback (when report => 'id')
      foreach my $row_aref (@rows) {
          say join "\t", $row_aref->@*;
      }

      say '';
      say $ghcn->get_stations( kept => 1 );

      say '';
      say 'Stations that failed to meet range or quality criteria:';
      say $ghcn->get_stations( kept => 0, no_header => 1 );

      if ( $ghcn->has_missing_data ) {
          warn '*W* some data was missing for the stations and date range processed' . $NL;
          say '';
          say $ghcn->get_missing_data_ranges;
      }

      say $ghcn->get_options;

      say $ghcn->get_timing_stats;

      say $ghcn->get_hash_stats;

      $ghcn->export_kml if $opt->kml;

CONFIGURATION OPTIONS
    StationTable supports two kinds of options: user and configuration. The
    main difference between the two is that configuration options are more
    suited to persistence; i.e. you'll most likely put them in a file that
    is used at every execution of StationTable.

  Cache
    Cache options are used internally by StationTable when it calls
    URI::Fetch to get pages of data from the GHCN web respository.

    root
        This defines a path to a folder which will be used to cache web
        pages. See the nonetwork user option for ways to control caching.

    namespace
        This defines the subfolder of root within which the cache files will
        reside.

  Aliases
    Aliases are a convenience feature that allow you to define mnemonic
    shortcuts for specific stations. GHCN station id's (like CA006106000)
    are difficult to remember and type, as can GHCN station names.
    Frequently-used station id's can be given easier alias names that can be
    use in the -location option for precise and reliable data retrieval.

    The entries within the aliases hash are simply keyword/value pairs that
    represent the mnemonic alias name and the station id (or id's) that are
    to be retrieved when that alias is used in -location.

  YAML Example
    This is what the YAML content for a typical configuation file would look
    like:

        ---
        cache:
            root: C:/ghcn_cache_new
            namespace: ghcn_new

        aliases:
            yow: CA006106000,CA006106001    # Ottawa airport
            cda: CA006105976,CA006105978    # Ottawa (CDA and CDA RCS)

  Hash Example
    Here's what the typical config file would look like as a perl hash
    structure:

        config_options => {
            cache => {
                root        => 'C:/ghcn_cache_new',
                namespace   => 'ghcn_new',
            }
            aliases => {
                yow => 'CA006106000,CA006106001',    # Ottawa airport
                cda => 'CA006105976,CA006105978',    # Ottawa (CDA and CDA RCS)
            }
        }

USER OPTIONS
    See ghcn_fetch.pl -help for a list of all user options in Getopts::Long
    format. Simply translate to a hash key/value pair. For example, -report
    id becomes report = 'id'>.

VERSIONING and COMPATIBILITY
    The version number scheme used for this module consists of a 3-part
    dot-delimited string such as v0.22.365. This format was chosen for
    compatibility with Dist::Zilla version support, so that all modules in
    GHCN will get the same version number upon release. See also
    <https://metacpan.org/pod/version>.

    The first digit of the string is a major release numbers. With the
    exception of v0 release, which should be considered experimental
    pre-production versions, the interface is intended to be upward
    compatible within a set of releases sharing the same major release
    number. If an incompatible change becomes necessary, the major release
    number will be incremented.

    The other two strings are essentially the date of the release, in the
    format YY.DDD where YY is the year of the century and DDD is the day
    number within the year.

AUTHOR
    Gary Puckering (jgpuckering@rogers.com)

LICENSE AND COPYRIGHT
    Copyright 2022, Gary Puckering

