Basic Data Types

Introduction

OCaml is a statically and strongly typed programming language. It is also an expression-oriented language, everything is a value, and every value has a type. Functions and types are the two foundational principles of OCaml. The OCaml type system is highly expressive, providing many advanced constructs. Yet, it is easy to use and unobtrusive. Thanks to type inference, programs can be written without typing annotations, except for documentation purposes and a few corner cases. The basic types and the type combination operations enable a vast range of possibilities.

This tutorial begins by a section presenting the types which are predefined in OCaml. It starts with atomic types such as integers and booleans. It continues by presenting predefined compound types such as strings and lists. The tutorial ends with a section about user-defined types: variants and records.

OCaml provides several other types, but they all are extensions of those presented in this tutorial. Types which are in the scope of this tutorial are all the basic constructors and most comon predefined types.

Prerequisies and Goals

This is an intermediate level tutorial. The only prerequisite is to have completed the get started series of tutorials.

The goal of this to tutorial is to provide for following capabilies:

Handle data of all predefined types using dedicated syntax
Write variant type definitions: simple, recursive and polymorphic
Write record type definitions
Write type aliases
Use pattern matching to define functions

Predefined Types

Integers, Characters, Booleans and Characters

Integers

Here is an integer:

# 42;;
- : int = 42

The int type is the default and basic type of integers numbers in OCaml. It represents platform dependent signed integers. This means int does not always have same the number of bits, depending on underlying platform characteristics such as processor architecture or operating system. Operations on int values are provided by the Stdlib and the Int modules.

Usually int has 31 bits in a 32-bit architectures and 63 in 64-bit architectures, one bit is reserved for OCaml's runtime operation. The standard library also provides Int32 and Int64 modules which supports platform independent operations on 32 and 64 bits signed integers. These modules are not detailed in this tutorial.

There are no dedicated types for unsigned integers in OCaml, bitwise operations on int just ignore the sign bit. Binary operators use standard symbols, signed remainder is writen mod. There is no predefined power operator on integers in OCaml.

Floats and Type Conversions

Fixed-size float numbers have type float. Operations on float complies with the IEEE 754 standard, with 53 bits of mantissa and exponent ranging from -1022 to 1023.

OCaml does not perform any implicit type conversion between values. Therefore, arithmetic expressions can't mix integers and floats, parameters are either all int or all float. Arithmetic operators on float are not the same, they are written with a dot suffix: +., -., *., /..

# let pi = 3.14159;;
val pi : float = 3.14159

# let tau = 2.0 *. pi;;
val tau : float = 6.28318

# let tau = 2 *. pi;;
Error: This expression has type int but an expression was expected of type
         float

# let tau = 2 * pi;;
Error: This expression has type float but an expression was expected of type
         int

Operations on float are provided by the Stdlib and the Float modules.

Booleans

Boolean values are represented by the type bool.

# true;;
- : bool = true

# false;;
- : bool = false

# false < true;;
- : bool = true

Operations on bool are provided by the Stdlib and the Bool modules. Conjunction (“and”) is written && and disjunction (“or”) is written \\; both don't evaluate their right argument if the value of their left argument is sufficient to deciced the value of the whole expression.

Characters

Values of type char correspond to the 256 symbols defined in the ISO/IEC 8859-1 standard. Character literals are surrounded by single quotes. Here is an example.

# 'a';;
- : char = 'a'

Operations on char values are provided by the Stdlib and the Char modules.

The module Uchar provides support for Unicode characters.

Strings & Byte Sequences

Strings

Strings are finite and fixed-sized sequences of values of type char. Strings are immutable, it is impossible to change the value of character inside a string. The string concatenation opeartor has symbol ^.

# "" ^ " " ^ "world!";;
- : string = "hello world!"

Indexed access to string characters is possible using the following syntax:

# "buenos dias".[4];;
- : char : 'o'

Operations on string values are provided by the Stdlib and the String modules.

Byte Sequences

Byte sequences are finite and fixed-sized sequences of bytes. Each individual byte is represented by a char value. Byte sequences are mutables, they can't be extended or shortened, but each component byte may be updated. Essentially, a byte sequence byte is a mutable string that can't be printed. There is no way to write a bytes literally, it must be produced by a function.

# String.to_bytes "hello";;
- : bytes = Bytes.of_string "hello"

Operations on bytes values are provided by the Stdlib and the Bytes modules. Only the function Bytes.get allows direct access to the characters contained in a byte sequence. There is not direct access operator on byte sequences.

Arrays & Lists

Arrays

Arrays are finite and fixed-sized sequences of values of a the same type. Here are a couple of examples:








# [| 0; 1; 2; 3; 4; 5 |];;
- : int array = [|0; 1; 2; 3; 4; 5|]

# [| 'a'; 'b'; 'c' |];;
- : char array = [|'a'; 'b'; 'c'|]

# [| "foo"; "bar"; "baz" |];;
- : string array = [|"foo"; "bar"; "baz"|]

Arrays may contains values of any type. Here arrays are int array, char array and string array, but any type of data can used in an array. Usually, array is said to be a polymorphic type. Strictly speaking it is a type operator, it accepts a type as parameter (here int, char and string) to form another type (those infered here). This is the empty array.

# [||];;
- : 'a array = [||]

Here 'a means “any type”. It is called a type variable and is usally pronounced like if it was the greek letter α (“alpha”). This the type parameter meant to be replaced by another type.

Like string and bytes, arrays support direct access, but the syntax is not the same.

# [| 'a'; 'b'; 'c' |].(2);;
- : char = 'c'

Arrays are mutables, they can't be extended or shortened, but each component value may be updated.

# let a = [| 'a'; 'b'; 'c'; 'd' |];;
val a : char array = [|'a'; 'b'; 'c'; 'd'|]

# a.(2) <- '3';;
- : unit = ()

# a;;
- : char array = [|'a'; 'b'; '3'; 'd'|]

Operations on arrays are provided by the Array modules. There is a dedicated tutorial Arrays.

Lists

As literals, list are very much like arrays. Here are the same examples as previously, turned into lists.








# [ 0; 1; 2; 3; 4; 5 ];;
- : int list = [0; 1; 2; 3; 4; 5]

# [ 'a'; 'b'; 'c' ];;
- : char list = ['a'; 'b'; 'c'|]

# [ "foo"; "bar"; "baz" ];;
- : string list = ["foo"; "bar"; "baz"]

Like arrays, lists are finite sequences of values of the same type. They also are polymorphic too. However, lists are extensible, immutable and don't support direct access to all the values it contains. Lists play a central role in functional programming, they are the subject of a dedicated tutorial.

Operations on lists are provided by the List module. The List.append function, which concatenates two lists can also be used as an operator with the symbol @.

Two symbols are of special importance with respect to lists.

The empty list is written [], has type 'a list' and is pronounced nil
The list constructor operator, written :: and pronounced “cons”, it is used to add a value at the head of a list

Together, they are the basic mean to build lists and access the data stored in lists. For instance here is how lists are build by successively applying the cons operator.

# 3 :: [];;
- : int list = [3]

# 2 :: 3 :: [];;
- : int list = [2; 3]

# 1 :: 2 :: 3 :: [];;
- : int list = [1; 2; 3]

Pattern-matching provides the basic mean to access data stored inside a list.

# match [1; 2; 3] with 
  | x :: u -> x
  | [] -> raise Exit;;
- : int = 1

# match [1; 2; 3] with 
  | x :: y :: u -> y
  | x :: u -> x
  | [] -> raise Exit;;
- : int = 2

In the above expressions [1; 2; 3] is the value which is matched over. Each expression between | and -> symbols is a pattern. They are expressions of type list, only formed using [], :: and variables names; representing various shapes a list may have. When the pattern is [] it means “if the list is empty”. When the pattern is x :: u it means “if the list contains data, let x be the first element of the list and u be the rest of the list.” Expression at the right of the -> symbols are the results returned in each corresponding case.

Operations on lists are provided by the List module. There is a dedicated tutorial on Lists.

Options & Results

Options

The option type is also a polymorphic type. Option values can store any kind of data, or represent absence of any such data. Option values can only be constructed in two different ways; either None when no data is available or Some otherwise.








# None;;
- : 'a option = None

# Some 42;;
- : int option = Some 42

# Some "hello";;
- : string option = Some "hello"

Here is an example of pattern matching on a option value.

# match Some 42 with None -> raise Exit | Some x -> x;;
- : int = 42

Operations on options are provided by the Option module. Options are discussed in the Error Handling guide.

Results

When it makes sense to mark the outcomes of a function as being either failure or success, the result type can do it. There are only two ways to build a result value; either using Ok or Error, with the intendended meaning. Both constructors can hold any kind of data. The result type is polymorphic but it has two type parameters, one for Ok values, another for None values.

# Ok 42;;
- : (int, 'a) result = Ok 42

# Error "Sorry";;
- : ('a, string) result = Error "Sorry"

Operations on results are provided by the Result module. Results are discussed in the Error Handling guide.

Tuples

Here is a tuple, actually a pair.

# (3, 'a');;
- : int * char = (3, 'a')

This is pair containing the integer 3 and the character 'a'; its type is int * char. The * symbol stands for product type.

This generalizes to tuples with 3 or more components, for instance : (6.28, true, "hello") has type float * bool * string. The types int * char and float * bool * string are called products types. The * symbol is used to

The predefined function fst returns the first component of a pair, while snd returns the second component of a pair.

# fst (3, 'a');;
- : int = 3

# snd (3, 'a');;
- : char = 'a'

In the standard library both are defined using pattern matching. Here is how a function extracting the third component of the product of four types.

# let f x = match x with (a, b, c, d) -> c;;
val f : 'a * 'b * 'c * 'd -> 'c = <fun>

Note that the product type operator * is not associative. Types int * char * bool, int * (char * bool) and (int * char) * bool are not same, the values (42, 'a', true), (42, ('a', true)) and ((42, 'a'), true) are not equal.

Functions

The type of functions from type a to type b is written a -> b. Here are a few examples:

# fun x -> x * x;;
- : int -> int = <fun>

# (fun x -> x * x) 9;;
- : int = 81

The first expression is an anoymous function of type int -> int. The type is infered from the expression x * x which must be of type int since * is an operator which returns an int. The <fun> printed in place of the value is token meaning function don't have a value to be displayed. This is because if they have been compiled, their code may not be available.

The second expression is function application, parameter 9 is applied, result 81 is returned.

# fun x -> x;;
- : 'a -> 'a = <fun>

# (fun x -> x) 42;;
- : int = 42

# (fun x -> x) "This is really disco!";;
- : string = "This is really disco!"

The first expression is another anonymous function, it is the identity function, it returns its argument, unchanged. This function can be applied to anything. Anything can be returned unchanged. This means the parameter of that function can be of any type, and result must have the same type. This is called polymorphism the same code can be applied to data of different types.

This is what is indicated by the 'a in the type (pronounced as the greek letter α, “alpha”). This is a type variable. It means values of any type can be passed to the function. When that happens, their type is substitued to the type variable. This also expresses identity has the same input and output type, whatever it may be.

The two following expressions shows the identity function can indeed be applied to parameters of different types.

# let f = fun x -> x * x;;
f : int -> int = <fun>

# f 9;;
- : int = 81

Defining a function is the same as giving a name to any value. This is was is illustrated in the first expression.

# let g x = x * x;;
g : int -> int

# g 9;;
- : int = 81

When writing in OCaml, a lot of function are written. The function g is defined here using a shorter, more common syntax and maybe more intuitive syntax.

In OCaml, functions may terminate without returning a value of the expected type by throwing an exception, this does not appear in its type. There is no way to know if a function may raise an exception without inspecting its code.

# raise;;
- : exn -> 'a' = <fun>

Functions may have several parameters.

# fun a b -> a ^ " " ^ b;;  
- : string -> string -> string = <fun>

# let mean a b = (a + b) / 2;;
val mean : int -> int -> int = <fun>

As of the product types symbol *, the function type symbol -> is not associative. These two types are not the same:

(int -> int) -> int : this is a function taking function of type int -> int as parameter, and returning an int as result
int -> (int -> int) : this is a function taking an int as paramter and returning a function of type int -> int as result

Unit

A unique value has type unit, it is written () and pronounced “unit”.

The unit type has several usages. One of its main roles is to serve as a token when a function does not need to be passed data or doesn't have any data to return once it has completed its computation. This happens when functions have side effects such as OS-level I/O. Functions need to be applied to something for their computation to be triggered, they also must return something. When nothing making sense can be passed or returned, () should be used.

# read_line;;
- : unit -> string = <fun>

# print_endline;;
- : string -> unit = <fun>

Function read_line reads an end-of-line terminated sequence of characters from standard input and returns it as a string. Reading input begins when () is passed.

Function print_endline prints the string followed by and line ending on standard output. Return of the unit value means the output request has been queued by the operating system.

User-Defined Types

Variants

Enumerated Data Types

The simplest form of a variant type corresponds to an enumerated type. It is defined by an explicit list of named values. Defined values are called constructors and must be capitalized.

For example, here how a variant data type could be defined to represent Dungeons & Dragons character classes and alignments.

# type character_class =
    | Barbarian
    | Bard
    | Cleric
    | Druid
    | Fighter
    | Monk
    | Paladin
    | Ranger
    | Rogue
    | Sorcerer
    | Warlock
    | Wizard;;
type character_class =
    Barbarian
  | Bard
  | Cleric
  | Druid
  | Fighter
  | Monk
  | Paladin
  | Ranger
  | Rogue
  | Sorcerer
  | Warlock
  | Wizard
  
# type character_alignment =
    | Lawful_good
    | Neutral_good
    | Chaotic_good
    | Lawful_neutral
    | Neutral
    | Chaotic_neutral
    | Lawful_evil
    | Neutral_evil
    | Chaotic_evil;;
type character_alignment =
    Lawful_good
  | Neutral_good
  | Chaotic_good
  | Lawful_neutral
  | Neutral
  | Chaotic_neutral
  | Lawful_evil
  | Neutral_evil
  | Chaotic_evil

Such kind of variant types can also be used to represent week days, cardinal
directions or any other fixed sized set of values that can be given names. A
total ordering is defined on values, following the definition order (e.g. Druid < Ranger).

Here how pattern matching can be done on types defined as such.

# let morality = function
    | Lawful_good -> 1
    | Neutral_good -> 1
    | Chaotic_good -> 1
    | Lawful_neutral -> 0
    | Neutral -> 0
    | Chaotic_neutral -> 0
    | Lawful_evil -> -1
    | Neutral_evil -> -1
    | Chaotic_evil -> -1;; 
val morality : character_alignment -> int = <fun>

Note that:

unit is an enumerated as a variant with a unique constructor is ().
bool is also an enumeated as a variant with two constructors : true and false.

A pair (x, y) has type a * b where a is the type of x and b is the type of y. Some may find intuiguing that a * b is called a product. Although this is not a complete explanation, here is a remark which may help understanding. Consider the product type character_class * character_alignement. There are 12 classes and 9 alignments. Any pair of values from those types inhabits the product type. Therefore, in the product type, there are 9 × 12 = 108 values, which also is a product.

Constructors With Data

It is possible to wrap data in constructors. The following type has several constructors with data and some without. It represents the different means to refer to a Git commit.

# type commit =
  | Hash of string
  | Tag of string
  | Branch of string
  | Head
  | Fetch_head
  | Orig_head
  | Merge_head;;
type commit =
    Hash of string
  | Tag of string
  | Branch of string
  | Head
  | Fetch_head
  | Orig_head
  | Merge_head

Here is how pattern matching can be used to write a function from commit to string

# let commit_to_string = function
  | Hash sha -> sha
  | Tag name -> name
  | Branch name -> name
  | Head -> "HEAD"
  | Fetch_head -> "FETCH_HEAD"
  | Orig_head -> "ORIG_HEAD"
  | Merge_head -> "MERGE_HEAD";;
val commit_to_string : commit -> string = <fun>

Here, the function ... construct is used instead of the match ... with ... construct. Previously, example functions had the form let f x = match x with ... and the variable x did not appear after any of the -> symbols. When it is the case the function ... construct can be used instead, it stands for fun x -> match x with ... and saves from finding a name which is used right after and only once.

Recursive Variants

A variant definition refering to itself is recursive. A constructor may wrap data from the type being defined.

This the case of the following definition, which can be used to store JSON values. Here is how it can look like:

# type json =
  | Null
  | Bool of bool
  | Int of int
  | Float of float
  | String of string
  | Array of json list
  | Object of (string * json) list;;

Both constructors Array and Object contain values of type json.

Functions defined using pattern matching on recursive variants are often recursive too. This functions checks if a name is present in a whole JSON tree.

# let rec has_field name = function
  | Array u -> 
      List.fold_left (fun b obj -> b || has_field name obj) false u
  | Object u ->
      List.fold_left (fun b (key, obj) -> b || key = name || has_field name obj) false u
  | _ -> false;;

Here, the last pattern is using the symbol _ which catches everything. It allows returning false on all data which is neither Array nor Object.

Polymorphic Data Types

Revisiting Predefined Types

The predefined type option is defined as a variant type, with two constructors: Some and None. It can contain values of any type, such as Some 42 or Some "hola". The variant option is polymorphic. Here is how it is defined in the standard library:

#show option;;
type 'a option = None | Some of 'a

The predefined type list is also a polymorphic variant with two constructors. Here is how it is defined in the standard library:

#show list;;
type 'a list = [] | (::) of 'a * 'a list

The only bit of magic here is the trick turning constructors into symbols. This is left unexplained in this tutorial. The types bool and unit also are regular variants, with the same magic:

#show unit;;
type unit = ()

#show bool;;
type bool = false | true

Implicitely, product types also behaves as variant types. For instance, pairs can be seen as inhabitants of this type:

# type ('a, 'b) pair = Pair of 'a * 'b;;
type ('a, 'b) pair = Pair of 'a * 'b

Where (int, bool) pair would be writen int * bool and Pair (42, true) would be written (42, true). From developer perspective, everything happens as if such a type would be declared for every possible product shape. This is what allows pattern matching on products.

Even integers and floats can be seen as enumerated-like variant types, with many constructors and funky syntactic sugar. This is what allows pattern matching on those types.

In the end, the only type construction which does not reduce to a variant is the function arrow type. No pattern matching on functions.

User-Defined Polymorphic

Here is an example of a variant type which combines constructors with data and without data, polymorhism and recursion.

# type 'a tree =
  | Leaf
  | Node of 'a * 'a tree * 'a tree;;
type 'a tree = Leaf | Node of 'a * 'a tree * 'a tree

It can be used to represent arbitrary labelled binary trees. Using pattern matching, here is how the a map function can be defined in this type:

# let rec map f = function
  | Leaf -> Leaf
  | Node (x, lft, rht) -> Node (f x, map f lft, map f rht);;
val map : ('a -> 'b) -> 'a tree -> 'b tree = <fun>

Remark: OCaml has someting called Polymorphic Variants. Although the types option, list and tree are variants and polymorphic, they aren't polymorphic variants, they are type parametrized variants. Among the functional programming community the word “polymorphism” is used loosely, whenever anything can be applied to various types. We stick to this usage and say the variants in this section are polymorphic. OCaml polymorphic variants are covered in another tutorial.

Records

Records are a like tuples, several values are bundled together. In a tuple, components are identified by their position in the corresponding product type. They are either first, second, third or at some position. In a record, each component is has a name. That's why record types must be declared before being used.

For instance, here is the defintion of a record type meant to partially represent a Dungeons & Dragons character class.

# type character = {
  name : string;
  level : int;
  race : string;
  class_type : character_class;
  alignment : character_alignment;
  armor_class : int;
};;
type character = {
  name : string;
  level : int;
  race : string;
  class_type : character_class;
  alignment : character_alignment;
  armor_class : int;
}

This is using the types character_class and character_alignment defined earlier. Values of type character are carrying the same data as inhabitants of this product: string * int * string * character_class * character_alignment * int.

Access to the fields is done using the dot notation. Here is an example:

# let ghorghor_bey = {
    name = "Ghôrghôr Bey";
    level = 17;
    race = "half-ogre";
    class_type = Fighter;
    alignment = Chaotic_neutral;
    armor_class = -8;
  };;
val ghorghor_bey : character =
  {name = "Ghôrghôr Bey"; level = 17; race = "half-ogre";
   class_type = Fighter; alignment = Chaotic_neutral; armor_class = -8}

# ghorghor_bey.alignment;;
- : character_alignment = Chaotic_neutral

# ghorghor_bey.class_type;;
- : character_class = Fighter

# ghorghor_bey.level;;
- : int = 17

To some extent, records also are variants, with a single constructor carrying all the fields as a tuple. Here is how to alternately define the character record as a variant.

# type character' = Character of string * int * string * character_class * character_alignment * int;;

# let name (Character (name, _, _, _, _, _)) = name;;
val name : character' -> string = <fun>

# let level (Character (_, level, _, _, _, _)) = level;;
val level : character' -> int = <fun>

# let race (Character (_, _, race, _, _, _)) = race;;
val race : character' -> string = <fun>

# let class_type (Character (_, _, _, class_type, _, _)) = class_type;;
val class_type : character' -> character_class = <fun>

# let alignment (Character (_, _, _, _, alignment, _)) = alignment;;
val alignment : character' -> character_alignment = <fun>

# let armor_class (Character (_, _, _, _, _, armor_class)) = armor_class;;
val armor_class : character -> int = <fun>

One function for each field, to get the data it contains. It provides the same funtionality as dotted notation.






# let ghorghor_bey' = Character ("Ghôrghôr Bey", 17, "half-ogre", Fighter, Chaotic_neutral, -8);;
val ghorghor_bey' : character =
  Character ("Ghôrghôr Bey", 17, "half-ogre", Fighter, Chaotic_neutral, -8)
  
# level ghorghor_bey';;
- : int = 17

Writting level ghorghor_bey' is the same as ghorghor_bey.level.

Remarks

To be true to facts, it is not possible to encode all records as variants since OCaml provides a mean to define fields those value can be updated which isn't avaiable while defining variant types. This is detailed in the tutorial on imperative programming.
Records SHOULD NOT be defined using this technique. It is only demonstrated here to further illustrate the expressive strengh of OCaml variants.
This way to define records MAY be applied to Generalized Algebraic Data Types which are the subject of another tutorial.

Type Aliases

Just like values, any type can be given a name.

# type latitude_longitude = float * float;;
type latitude_longitude = float * float

This is mostly useful as a mean of documentation or as mean to shorten long type expressions.

Conclusion

This tutorial has provided a comprehensive overview of the basic data types in OCaml and their usage. We have explored the built-in types, such as integers, floats, characters, lists, tuples and strings, and user-defined types: records and variant types. Records and tuples are mechanisms for grouping heterogeneous data into cohesive units. Variants are a mechanism for exposing heterogeneous data as coherent alternatives.

From the data point of view, records and tuples are like conjunction (logical “and”), while variants are like disjunction (logical “or”). This analogy goes very deep, with records and tuples on one side as products and variants on the other side as union. These are true mathematical operations on data types. Records and tuples play the role of multiplication, that why they are called product types. Variants play the role of addition. Putting it all together, basic OCaml types are said to be algebraic.

Next: Advanced Data Types

Going further, there are several advanced topics related to data types in OCaml that you can explore to deepen your understanding and enhance your programming skills.

The Algebra of Types
Mutually Recursive Variants
Polymorphic Variants
Extensible Variants
Generalised Algebraic Data Types