The infochimps project uses an Infochimps Stupid Schema file to describe each dataset. I know, I know—the world really didn’t need another schema format. But have you looked at the other ones? They’re WAY too hard for our little chimpanzee brains. The format we use is much stupider and therefore much more powerful.
Go ahead and open this full-figured example in a new window:
No wait! Come back! They don’t have to be that big and wordy; all that’s required for us to immediately load in a dataset or collection of datasets is a unique name for the collection and its datasets, and a citation for each contributor. That’s it—Every other field is optional.
Not only that, but most of them are free text. There’s eventually going to be some extra structure imposed on things like ‘contributor citations’ and ‘physical units’ for fields and ‘spatial/temporal coverage’, but until we find out what makes sense you should just type in something awesome and descriptive. (We’re figuring out something new here, and we won’t have a good idea of what’s needed until we’ve got a few thousands of these things.) So: follow the structure described below, don’t overthink things, and if there’s something extra include it as a `note:` field.
Name for This Dataset (free text) – in Titlecase
uniq_name_as_identifier
If no name is specified, the field name will be turned into a unique identifier
Is this one part of a collection of datasets, like the tables from the
Statistical Abstract or IMDB? Name the collection here.
Note: the collection, format, size, shape and structure fields may become
one ‘privileged tag’ field; at the moment formats and collection are freetext fields,
while size shape and structure should be just stuffed into notes.
”...This dataset should appear if I search on ’...’” <—those are tags
Ex:
tags: "gamma, ray, radiation, man in the moon, marigolds, botany, nuclear"
You can type in anything as a note.
The identifier will be titleized (_ to spaces, words Capitalized) and appear as a header (a <h2></h2>).
Inject quoted YAML as a string if you’re including structured information.
An anchor link is created for each note, so if you’d like to link back to a different note field use
"link text":#note_identifier
For example,
"(usage overview)":#usage
- desc: |
21,986 names (names.txt)
This database contains the most common names
used in the United States and Great Britain.
Spelling checkers may want to supplement their
basic word list with this one.
Anything non-obvious about how to use or apply the dataset; Technical Notes on how to use the data and how it was gathered. This is where all the really nerdy stuff goes.Example: - usage: |
To convert dollars of any year 1665 to estimated
2017 to dollars of the base year (CPI [1982-84],
2005, or other years, DIVIDE that year's dollar
amount by the conversion factor (CF) for that
base year, rounded to no more than four decimal
places (more cautious: prior to 1913 round to 2
decimal places; for 1913 and later round to 3
decimal places).
You can put in any old note name that suggests itself.
List the people/organizations who created or prepared the dataset. Include links and citations wherever possible. This gives credit where it’s deserved, and allows people to trace the provenance of the data.
As always, omit fields that don’t apply.
note, information about the contribution, rights, &c. here.
- name: Yahoo Finance
uniqid: finance.yahoo.com
url: http://ichart.finance.yahoo.com/table.csv?s=AAPL&a=00&b=1&c=1900&d=11&e=31&f=2030&g=v&ignore=.csv
role: distributed
- name: NASDAQ site
uniqid: nasdaq.com
desc: Company Symbols to Names
url: http://www.nasdaq.com/asp/symbols.asp?exchange=q
- name: Philip (flip) Kromer
uniqid: infochimps.org/flip
url: http://infochimps.org/flip
role: converted
- name: '"The Origin of Chemical Elements," by Bethe and Gamow
desc: The dataset is described in this journal article
cite: >
'Alpher, R. A., H. Bethe and G. Gamow.
"The Origin of Chemical Elements," Physical
Review, 73 (1948), 803.'
Each field is a record with one or more of the following.
As always, omit anything that doesn’t make sense or you don’t feel like filling in.
And remember that a thing can be more than one kind of thing. Buckaroo Banzai was a Rock Star, Physicist, Movie Character and (one assumes) a Taxpayer. Chicken is a type of bird (isa animal…), an agricultural commodity and a recipe ingredient. It’s more important to just simply and quickly describe things as they are than to design some straightjacket schema of perfect crystalline beauty (and corresponding brittleness and inutility).
UniqueNameAsIdentifier. If no name is specified, the field name will be turned
into a unique identifier
Name for This Field (free text) – in Titlecase
A dataset can (and should) be as structured as you like, but right now we
only summarize a flat list of datafields. For hierarchical datasets, do
something reasonable like use a common prefix+underbar: fields:
- name: Resident population per Square mile of land area
tags: country numberdensity:persons-area
units: persons / mile^2
datatype: float
uniqid: resident_population_per_square_mile_of_land_area
Err on the side of longer, rather than shorter, datafield names—we eventually want to identify common patterns (temperature, geographic location, chemical element, etc) and structure them correspondingly.
One of the kwalify
datatypes:
The concept represented by this datafield, as separate from its representation.
‘distance’, ‘time’, ‘value.money’ are concepts; ‘meters’, ‘year’,
‘currency.usd.2005’ are ways to represent those concepts.
Presentation of this concept, as space-separated string of atomic units.
Use anything that the Frink units library (an extension of the BSD units collection) understands.
This is too much text for something so poorly thought out right now. Here it is
anyhoo.
5: Prepared by a researcher and leading expert in this field, associated with a noteable institution, publishing peer reviewed data, with clearly-cited sources.
1: Wikipedia articles that have not passed their review processes.
5: Exhaustive characterizatopm. Think “US Census” or IMDB.com.
1: This dataset, though useful, contains an incomplete picture of its subject. For example, at time of writing, we have only about 35 years of stock market data with only US stocks and daily intervals.
5: any true data nerd will stop in the street to gaze in wonder at the opportunities present in this dataset.
1: Most people will never find a need for this. The very few that do will be stoked you helped put it here.
When you’re ready to upload either data or schema files visit HOWTO Upload