Why DSL Matters

In my last post I gave you an introduction to the SchemaType language. It was meant to raise questions and I got a few. Mostly they were centered around using DSL shortcuts instead of using pure data. In this post I’ll tell more of the SchemaType language story.

Let’s consider this JSON Schema flagship example.

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "Example Schema",
    "type": "object",
    "properties": {
        "firstName": {
            "type": "string"
        },
        "lastName": {
            "type": "string"
        },
        "age": {
            "description": "Age in years",
            "type": "integer",
            "minimum": 0
        }
    },
    "required": ["firstName", "lastName"]
}

In SchemaType, this would look like:

-spec: schematype.org/v0.0.1
-from: github:schematype/type/#v0.1.2
-desc: Example Schema
firstName: +string
lastName: +string
age?: +int 0.. --Age in years

Two things come to mind. Either you love it because it is so tight, or you hate it because it is so tight! Well don’t worry, SchemaType has you covered.

The DSL syntax used above is just syntactic sugar for the more explicit syntax:

firstName:
  -type: string
lastName:
  -type: string
age:
  -type: int
  -min: 0
  -desc: Age in years
  -opt: true

You can also use a mix like this:

firstName: +string
lastName: +string
age?:
  -: +int 0..
  -desc: Age in years

For the age field we use the DSL for the type and min constraint. These go under the special - key which indicates DSL is in the value. This is needed since the outer value is an object now, instead of a single YAML string value. Then we use the explicit -desc for the description field. You might find that this form looks best in your schema.

Note that in the more explicit example above, there is still some DSL; namely the - in front of the SchemaType keywords. Of course we can work out other ways to turn that into data too, but then we approach looking like JSON Schema.

The point is to make these documents, which can become massive in size, readable and maintainable. What happens when the schema you are defining has an object with keys like “type”, “description” and “required”? Well you have to figure out what level you are at to determine whether they are for your schema or for JSON Schema. With SchemaType, everything is instantly recognizable because of the DSL features.

That’s the main point that I want to convey in this post. For SchemaType to become popular it not only needs to be extremely powerful, it needs to have a flexible syntax so that people actually like working with it. You might have noticed that even though SchemaType is a YAML based format, none of the YAML quoting styles are ever used. This is on purpose. YAML only requires quoting when the first character is a space or YAML syntax character, or when there is a _# or :_ sequence in the string. The SchemaType DSL syntax is careful to avoid these so that you can use unquoted strings for everything.

One nice thing about the SchemaType syntax is that you don’t need to decide up front which style to use. The schematype CLI can reformat your schema files for you:

$ schematype format --compact some.stp
$ schematype format --explicit some.stp

In fact, SchemaType schemas get compiled before they are used. All the external references are pulled in and tied together into one big and very explicit data structure that is even more unreadable than JSON Schema. You never need to look at these files unless you are debugging something. When you publish a SchemaType schema, you are encouraged to publish the compiled form:

$ schematype compile --sign some.stp -o some.stx

This will create a digitally signed compiled schema file, that other schema types will use when they reference your hosted schema. .stx is the SchemaType extension for a compiled schema. It is actually stored as compact JSON.

Stay Tuned

I’ve written a few blog posts this week to give you an idea of where SchemaType is headed as a language and as a software development framework. The next things I’ll be working on are:

  • SchemaType Language Documentation
  • schematype CLI tool
    • compile a SchemaType schema file to the canonical immutable form
    • import a SchemaType file from an existing JSON Schema file
    • export a SchemaType file to a JSON Schema file
    • validate a data file against a schema file
    • generate various software components. Examples:
      • Go lang data structs
      • HTML input form
      • SQL schema file
      • Protobuf message definition
      • Swagger v2 schema definitions
  • Start the SchemaType v0.0.1 specification

Please join me if you have ideas to contribute or just want to play along. Cheers!

– Ingy döt Net

SchemaType Basic Language Design

In yesterday’s post I talked in broad terms about the scope, purpose and intention of SchemaType without actually showing what the language looks like. In this post I’ll fix that!

Assume you have this person.yaml file:

name: Ingy döt Net
age: 17
dead: false

The simplest way to define a person.stp (.stp is the SchemaType file extension) file might look like this:

name: string
age: integer
dead: boolean

This is the starting point of SchemaType. Let’s talk about what we have so far. First off the overall input format for SchemaType is YAML. Note that since JSON is a proper subset of YAML, you could also write SchemaType files in JSON.

The definition of an object is simply a mapping of its keys to the type of its value. This is very simple (which is good), but it’s too simple to be a real language. Let’s move forward with a few additions:

-name: /person
-desc: Data about a person
-spec: schematype.org/v0.0.1
-from: github:schematype/type/#v0.1.6

name: +human/name
age: +int 1..100
dead?: +boolean

This is a valid and complete SchemaType definition. You could publish this file for reuse at http://yourdomain.net/person.stp.

The first thing you probably noticed are the unusual punctuation characters like -, +, ? and %. SchemaType is a YAML based DSL. The intention of the DSL is to keep the definitions concise. At the expense of learning just a few syntax concepts, you can keep simple things simple and massive things manageable.

Keys that begin with a - belong to the SchemaType language. The first 4 keys will be present in almost every SchemaType definition.

The -name field refers to the name of the type being defined in this file. It should match the base name of the file that it is stored in. A SchemaType file can define multiple types. In that case the name should be set to / and the file name should be index.stp. A condensed example of an index.stp file that defines the types type1 and type2 would look like:

-name: /
+type1: ...
+type2: ...

Interestingly the ... is actually SchemaType syntax. It denotes a type reference of +any, ie any type. More about that in a future post.

The -desc field is a short description of the purpose of this schema. The -spec field indicates the version of the SchemaType Language Specification being used.

The most important keyword here is from. Within a SchemaType definition, types are referenced by names preceded with a +. In reality, each +type1 reference expands to a full URL like https://github.com/schematype/type/blob/f54b0aa/type1. The -from is a set of type shorthands to their immutable definition URLs. Again, more about that in future posts.

Back to the meat of our SchemaType example from above:

name: +human/name
age: +int 1..100
dead?: +boolean

If you see a SchemaType definition that uses +string it should be a red flag that the schema author is DoingItWrong™. The +human/name type (defined by schematype/type) inherits from +string but it places much more strict constraints on valid values. For instance a URL or a GUID or the full text of the US Constitution are all valid strings but will not be valid names of people.

SchemaType is very big on type inheritance. SchemaType only defines 5 types in its spec: String, Number, Boolean, Null, and Any, but none of these are ever referenced directly in schemas.

Next consider the age field. This is actually creating a new temporary or anonymous type. It inherits from +int but then places a range constraint on it. Another way to do this would be:

name: +human/name
age: +age
dead?: +boolean

+age: +int 1..100

The difference here is that we named the subtype. It is therefore available publicly as person/age in our schema publication.

Last, but certainly not least is the ? n dead?. It simply means that the dead key/value pair is optional. This implies the name and age are required. And that’s true. All keys in a definition are required by default. This stands in contrast to JSON Schema where all pairs are optional by default and unspecified keys are allowed. To allow any extra key/value pair in SchemaType you need to declare it like this:

name: +human/name
age: +int 1..100
dead?: +boolean
...: ...

There’s that ... again. This means that any key is allowed to map to any value type.

That wraps up the explanation of a very basic SchemaType document. For comparison, let’s look at the JSON Schema equivalent of our example:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "Person",
    "type": "object",
    "properties": {
        "name": {
            "$ref": "https://example.com/json.schema#human/name"
        },
        "age": {
            "type": "integer",
            "minimum": 1,
            "maximum": 100
        },
        "dead": {
            "type": "boolean"
        }
    },
    "additionalProperties": false,
    "required": [
      "name",
      "age"
    ]
}

This is a rough equivalent. This example is not so bad because you can see the whole thing on one page. In reality, JSON Schema gets unwieldy fast. A primary reason is that things like required are not made known in the place where the key is defined. I’ve seen many files where that information is 1000s of lines apart. The other thing is that it is hard to tell what parts are the JSON Schema language words and what parts are the data you are defining. SchemaType tries to make all this easy, compact and less painful.

This post is an introduction to the language basics. It was intended to raise more questions than it answers. There is much more to cover, but hopefully that’s enough to start you thinking about SchemaType!

SchemaType! Can you define that for me?

SchemaType is a new definition language for data. More specifically the data formats of the web that don’t already have definitions. Formats like YAML, JSON and CSV. Basically any format where the data is (or can be) represented as “objects”; ie mappings of known string keys to typed values. There’s more to the language of course, but defining objects is certainly a core principle.

SchemaType, as the name implies, defines schemas and types. Effectively schemas and types are the same thing: named structural and semantic definitions of data points. The word “type” is usually applied to single values; things like “string”, “integer”, “boolean” and “date”. The term “schema”, on the other hand, is applied to data collection types; things like “address”, which might be composed of “street”, “city”, “state”, and “postal-code”. The SchemaType language is concerned with the definitions of both.

Define the Data Web by 2018

The SchemaType project would like to define all open, undefined data in the next 2 years. This is a lofty goal, but the return on investment is grand, as I’ll try to show. Also, it turns out that schema can be substantially generated from valid data, so the data of the web could be crawled and a massive set of schemas could be started.

How would this be useful though? Glad you asked. It is my belief that software development is moving away from people writing code, and having computers write (generate) code from data. Writing and maintaining the same kinds of code for every data source in many programming languages becomes infeasible. There are many types of code that can be written for you, and written better and faster and more continuously than any human possibly could. If you accept this logic, then the only programming left to do is writing the code that generates more code.

I think we are a long ways off from replacing software development with code generation, but we can make inroads immediately. What’s missing? Well the biggest missing part is accurately defined data, ie Schema. Schema needs a language that can move beyond basic validation, and provide richer semantic info (I like to call them “hints”) on what code should be generated in various contexts. For instance, declaring that an object property is a “string” is fairly useless. Saying that it is a multi line, printable unicode text paragraph (which happens to inherit from string), can provide an input UI generator with the right hint to make a good UI.

Here are a few of the kinds of code that generation I see Schema helping make possible:

  • Data validation (obviously)
  • End user input UI
  • Data visualization
  • Test data generation
  • Language specific “typedef” code
  • Translators between formats
  • RDBMS modeling and SQL generation
  • Protobuf definition generators

I’ll talk more about each of these points in upcoming posts.

Why SchemaType?

As one of the creators of YAML, I have known for a decade that Schema is the next big effort. It seems that now, in the middle of 2016, all the stars have aligned, the mass has become critical and time is ripe for SchemaType.

JSON Schema is the closest effort to compare SchemaType to. It has gained some acceptance in various places but seems to have stalled out in forward progress. Still it is a good starting point to research, as much time has been invested in it. I believe SchemaType will offer at least these advantages:

  • Simpler to author, read and maintain
    • JSON Schema gets too complex too quickly
    • SchemaType docs look closer to the data they are defining
  • Easy to publish and share important schemas
    • Like GitHub, forking schemas is a good thing
  • Leverage ready-made code generation tools
    • SchemaType will have an ever growing set of tools
  • Provide type/schema definition immutability
    • Schemas import other schemas at specific commit points
  • Provide digital signing to published open types
    • Know, verify and trust the sources of your import schemas
  • Promote multiple authorities of trust
    • Anyone can define, host, fork the base schemas
  • International
    • SchemaType will consider I18N issues from the get-go

To get things rolling with SchemaType, early implementations will be able to not only import JSON Schema to SchemaType but also to export SchemaType definitions back to JSON Schema. Therefore if you are already invested in JSON Schema, it will be easy to either make the switch, or simply use the more expressive and elegant SchemaType and then save it back to JSON Schema.

Join The Schema Revolution

Strong type definitions for simple open data sources, combined with code generators for the common parts of software development is the future; and you can help bring it about sooner. SchemaType will be defined and implemented in the open. If you are interested please join us.

Yours truly, Ingy döt Net.