Data characterization

Date

Wednesday, January 14, 2026

Notes

In class, we discussed:

  1. The previous meeting’s “in-class” API assignment. As part of this discussion, I presented some “ESS 469/569 best practices” that future submissions must follow:

    • Include a README.md within each project directory. This file should provide a user with background information as well as details about the folder and code structure.
    • Never store data in the repository.
    • Separate code from communication. Think of .md files as presentations.
    • Code must include detailed comments. If desired, use models (e.g., in the form of LLM agents) to write comments and then verify/edit the outputs.
    • Figures must have captions.
    • Always provide some metadata (see below).
  2. The need to include basic metadata about the datasets that you use, such as:

    • Filename(s).
    • Original source & when accessed.
    • Data types. Not just raster, time series, etc. but also value type (e.g., integer, float, strings, booleans).
    • Filesizes or checksums.
    • Any attributes that a user should know about (e.g., in the case of georeferenced rasters, you may want to include some details about the file projection).
  3. Basic questions you should answer when first examining a dataset (i.e., characterization):

    • What are the units? Do the values make sense?
    • How complete are the data?
    • Are there data gaps or missing values? What form do those missing values take (e.g., NaN).
    • Are there any duplicated data?
    • Which attributes are important to the question at hand? (This inquiry can be challenging to answer, especially in the context of AI/ML)?
    • How are your data distributed? What are some of their fundamental statistical properties?
  4. The idea that you should develop your own visual language, in part by copying the parts of figures that you especially like.