infochimps.org - help
HOWTO Schema

infochimps.org Stupid Schema Template

  • This is part of the InfiniteMonkeywrench that powers infochimps.org
  • We use Textile to make descriptions that look great in plain text and which turn into styled HTML. It’s really simple to use.
  • The stupid schema is a “YAML”: file, which is very easy to use and work with; take the Five Minute Yaml Tutorial to get all you’ll need to know. It’s also simple to use.

Overview

The infochimps project uses an Infochimps Stupid Schema file to describe each dataset. I know, I know—the world really didn’t need another schema format. But have you looked at the other ones? They’re WAY too hard for our little chimpanzee brains. The format we use is much stupider and therefore much more powerful.

Go ahead and open this full-figured example in a new window:

No wait! Come back! They don’t have to be that big and wordy; all that’s required for us to immediately load in a dataset or collection of datasets is a unique name for the collection and its datasets, and a citation for each contributor. That’s it—Every other field is optional.

Not only that, but most of them are free text. There’s eventually going to be some extra structure imposed on things like ‘contributor citations’ and ‘physical units’ for fields and ‘spatial/temporal coverage’, but until we find out what makes sense you should just type in something awesome and descriptive. (We’re figuring out something new here, and we won’t have a good idea of what’s needed until we’ve got a few thousands of these things.) So: follow the structure described below, don’t overthink things, and if there’s something extra include it as a `note:` field.

name:

Name for This Dataset (free text) – in Titlecase

uniqid:

uniq_name_as_identifier

If no name is specified, the field name will be turned into a unique identifier

collection:

Is this one part of a collection of datasets, like the tables from the
Statistical Abstract or IMDB? Name the collection here.

Note: the collection, format, size, shape and structure fields may become
one ‘privileged tag’ field; at the moment formats and collection are freetext fields,
while size shape and structure should be just stuffed into notes.

formats

Some of
  • xls
  • csv
  • yaml
  • xml
  • tsv (tab separated)
  • flat (fixed-width columns).
    Not using mime types but maybe we should. Feel free to qualify your entry or extend this list.

Tags

”...This dataset should appear if I search on ’...’” <—those are tags

Ex:

tags:  "gamma, ray, radiation, man in the moon, marigolds, botany, nuclear"

Notes—Freeform Descriptive Notes

You can type in anything as a note.

The identifier will be titleized (_ to spaces, words Capitalized) and appear as a header (a <h2></h2>).

Inject quoted YAML as a string if you’re including structured information.

An anchor link is created for each note, so if you’d like to link back to a different note field use "link text":#note_identifier
For example, "(usage overview)":#usage

Recommended: include these if applicable

  • desc: Describe the dataset, what it’s good for, how it was prepared. A normal person should be able to read this fearlessly. Example:
          - desc: |
              21,986 names (names.txt)
    
              This database contains the most common names
              used in the United States and Great Britain.
              Spelling checkers may want to supplement their
              basic word list with this one.
    
  • usage:

    Anything non-obvious about how to use or apply the dataset; Technical Notes on how to use the data and how it was gathered. This is where all the really nerdy stuff goes.Example:

          - usage: |
              To convert dollars of any year 1665 to estimated
              2017 to dollars of the base year (CPI [1982-84],
              2005, or other years, DIVIDE that year's dollar
              amount by the conversion factor (CF) for that
              base year, rounded to no more than four decimal
              places (more cautious: prior to 1913 round to 2
              decimal places; for 1913 and later round to 3
              decimal places).
    

  • see_also: Other related datasets (use their uniqid identifier for now)
  • rights: Any rights statements attached by contributors. Use good manners and respect people’s hard work.

Other suggestions:

You can put in any old note name that suggests itself.

  • implementation_suggestion: How this file might be used or adjusted for a specific application.
  • file_structure: If it’s not YAML/XML/CSV or other standard, describe the file structure.
  • snippet: A representative slice of the raw data. Limit yourself to under a few thousand characters. (filed under notes)
  • structure: Yaml-looking text, or something else if it makes more sense. Some ideas: table: [ rows, cols ] tree: [ width, depth ] graph: [ nodes, edges ] ... If the data isn’t in a standard format (XML, YAML, JSON), estimate the size and dimensionality of the file.
  • stats:
    • value: min, max, avg, stdev, %ile
    • frequency: num_distinct, median_freq, mode
    • length: min, max, avg, stdev, %ile
  • coverage Indicate the coverage, if applicable: spatial (Central Africa? Mars?),temporal (Pliestocene Era? 1970s?) or whatever else would seem usefully descriptive.
  • more: Date Format Type Creator Rights Publisher Identifier Source Relation Language Keywords Coverage Description Contributors URI Fragment

contributors

List the people/organizations who created or prepared the dataset. Include links and citations wherever possible. This gives credit where it’s deserved, and allows people to trace the provenance of the data.

As always, omit fields that don’t apply.

  • name – freetext name
  • uniqid – identifier style uniq name. Unlike other identifiers, which should be only (letters,numbers,underscore), this should be (if it makes sense to do so) in `domain.name.com/minimal/path` form (then we can do the reversed-domain-plus-path sorting thing).
  • roles
    • collected, converted, distributed, verified, translated_language
  • cite – citation (in wikipedia citation style), if any
  • desc - Free form description of contribution, along with any statement by the contributor. Put information about the dataset’s content in the appropriate note, information about the contribution, rights, &c. here.

Example:

    - name:       Yahoo Finance
      uniqid:     finance.yahoo.com
      url:        http://ichart.finance.yahoo.com/table.csv?s=AAPL&a=00&b=1&c=1900&d=11&e=31&f=2030&g=v&ignore=.csv
      role:       distributed
  
    - name:       NASDAQ site
      uniqid:     nasdaq.com
      desc:       Company Symbols to Names
      url:        http://www.nasdaq.com/asp/symbols.asp?exchange=q
	  
    - name:       Philip (flip) Kromer
      uniqid:     infochimps.org/flip
      url:        http://infochimps.org/flip
      role:       converted

    - name:       '"The Origin of Chemical Elements," by Bethe and Gamow
      desc:       The dataset is described in this journal article
      cite:       >
        'Alpher, R. A., H. Bethe and G. Gamow. 
        "The Origin of Chemical Elements," Physical 
        Review, 73 (1948), 803.'
      

Fields

Each field is a record with one or more of the following.

As always, omit anything that doesn’t make sense or you don’t feel like filling in.

And remember that a thing can be more than one kind of thing. Buckaroo Banzai was a Rock Star, Physicist, Movie Character and (one assumes) a Taxpayer. Chicken is a type of bird (isa animal…), an agricultural commodity and a recipe ingredient. It’s more important to just simply and quickly describe things as they are than to design some straightjacket schema of perfect crystalline beauty (and corresponding brittleness and inutility).

Data Field uniqid:

UniqueNameAsIdentifier. If no name is specified, the field name will be turned
into a unique identifier

Data Field name:

Name for This Field (free text) – in Titlecase

A dataset can (and should) be as structured as you like, but right now we
only summarize a flat list of datafields. For hierarchical datasets, do
something reasonable like use a common prefix+underbar:

    fields:
    - name:      Resident population per Square mile of land area
      tags:      country numberdensity:persons-area
      units:     persons / mile^2
      datatype:  float
      uniqid:    resident_population_per_square_mile_of_land_area

Err on the side of longer, rather than shorter, datafield names—we eventually want to identify common patterns (temperature, geographic location, chemical element, etc) and structure them correspondingly.

Data Field datatype:

One of the kwalify
datatypes:

  • str
  • int
  • float
  • number (== int or float)
  • text (== str or number)
  • bool
  • date
  • time
  • timestamp
  • scalar (all but seq and map)
  • seq
  • map
  • any (means any data)

Data Field tags:

The concept represented by this datafield, as separate from its representation.

‘distance’, ‘time’, ‘value.money’ are concepts; ‘meters’, ‘year’,
‘currency.usd.2005’ are ways to represent those concepts.

Ex:
  • “Total yearly exports in constant dollars” with “exports rate:value.money country”
  • “Frequency of search terms” with “numberdensity language.phrase internet”

Data Field units:

Presentation of this concept, as space-separated string of atomic units.

Use anything that the Frink units library (an extension of the BSD units collection) understands.

  • ‘newtons’ and 'kilogram meters / seconds^2' are the same thing.
  • If something is a percent change, specify as (thing / thing)% or (thing / thing)percent; and if something is a delta change, give the time period of that change. For example, Percent composition by mass of the earth's atmosphere would have units (kg/kg)percent; population percentage change by year would have units of (persons/persons)% / year.

Ratings

This is too much text for something so poorly thought out right now. Here it is
anyhoo.

  • Accurate: How well (how precisely and how accurately) does this data characterize its subject? A dataset can be highly accurate but only moderately authoritative: a collection of data from wikipedia or other crowdsourced knowledge, but that has been widely tested and found to be of high quality. Or it can be authoritative with poor precision: the 5000 year eclipse table is fully authoritative, but due to a slow drift in the length of a day, eclipse times for 2000 years ago have uncertainties of several hours. If a dataset lacks accuracy but estimates its standard error, count that in its favor.
  • Authoritative: What are the credentials of this dataset’s sources?

    5: Prepared by a researcher and leading expert in this field, associated with a noteable institution, publishing peer reviewed data, with clearly-cited sources.

    1: Wikipedia articles that have not passed their review processes.

  • Comprehensive: How completely does this dataset describe its subject?

    5: Exhaustive characterizatopm. Think “US Census” or IMDB.com.

    1: This dataset, though useful, contains an incomplete picture of its subject. For example, at time of writing, we have only about 35 years of stock market data with only US stocks and daily intervals.

  • Interest: How broadly interesting is this dataset?

    5: any true data nerd will stop in the street to gaze in wonder at the opportunities present in this dataset.

    1: Most people will never find a need for this. The very few that do will be stoked you helped put it here.

Kwalify schema

  • name
  • desc
  • default
  • (class name)
  • units
  • tags
  • type:
    • scalar
      • text
      • string
      • number:
        • int
        • fixed
        • float
        • bool
      • datetime
        • date
        • time
    • seq
    • map
    • (graph)
    • (tree)
    • any
  • constraints:
    • required
    • length min max
    • value min max
    • pattern
    • enum
    • accuracy
    • unique *stats:
    • value: min, max, avg, stdev, %ile
    • frequency: num_distinct, median_freq mode
    • length: min, max, avg, stdev, %ile,

How to Upload

When you’re ready to upload either data or schema files visit HOWTO Upload