What is UNIX
Data Size Neutrality

A Draft White Paper from the X/Open Base Working Group.

Version 1 Last update Jan 12 1997.

Full Text Available in Go Solo 2.

Abstract
This white paper gives a brief outline of the changes made to achieve data size neutrality in the Single UNIX Specification.
This paper is part of a series of brief papers describing new and changed features in the Eastwood specification.

Data Size Neutrality for the Single UNIX Specification

When the UNIX system was first created in 1969, it was developed to run on a 16-bit computer architecture. The integer arithmetic operators and pointer operators used 16-bit quantities. The C language not only supported 16-bit integer and pointer data types but it also supported a 32-bit integer data type that could be emulated on hardware that does not support 32-bit arithmetic operations.

When the 32-bit minicomputer was introduced in the late 1970s, the UNIX system was ported to this new class of machine. The predominate hardware architectures of the Eighties provided 32-bit integer arithmetic operators and 32-bit pointers. The C language for this class of machines developed a data model with a 16-bit short-integer type, a 32-bit integer type and a 32-bit pointer. During the 1980s, this was the predominate data model available with UNIX systems that executed on a 32-bit computer architecture. Although it is easy to assume that this was the dominant data model of the 1980s, this assumption can be dispelled by reviewing other contemporary operating systems. For instance, the Disk Operating System (DOS) started the decade of the 1980s with 16-bit integers and pointers, but it later introduced 32-bit pointers while retaining the 16-bit integer architecture. Long-integers were emulated in software as 32-bits on early versions of this architecture. To describe these two systems in the jargon of today, the UNIX system on 32-bit processors is an ILP-32 data model because the Integer, Long and Pointer data types are all 32-bits in size. DOS on the Intel architecture started out as IP-16 data model (Integer and Pointer data types both 16-bits) and transitioned to an LP-32 data model (Integers 16-bits, Long and Pointer data types 32-bits in size) in the later years.

The standardization of the UNIX system began with a /usr/group committee in 1983. By 1988, the IEEE POSIX committee and the X/Open "consortium" had developed detailed specifications that were based upon the predominate implementations of the time. These committees were striving to develop architecture neutral definitions that could be implemented on any hardware architecture. Since the standards were based on existing practice and the ILP model did not change during this gestation period, a dependency upon all conforming implementations using the ILP model was inadvertently incorporated into these standards. There were implementations available both as ILP-32 and as ILP-64 by the end of the decade.

Standardization of ISO C language left the definition of the short integer, the integer, the long integer and the pointer vague to avoid artificially constraining hardware architectures that might benefit from defining these data types each independent from the other. It is possible, for instance, to define a short as 16-bits, an integer as 32-bits, a long as 64-bits and a pointer as 128-bits.

The transition from 16- to 32-bit hardware architectures happened quite rapidly just before UNIX standardization was begun. Issues of compatibility and migration were left to the system vendors to solve without the need to consider standards. It was believed that the introduction of 64-bit systems would naturally follow the ILP data model. However, this simplistic view overlooks optimizations that can be obtained by choosing a different model, such as IP-64, LP-64 or IL-64. This paper will not describe the advantages of these data models. For that, the reader is referred to the LP-64 Data Model Paper.

When it was understood that the POSIX and the Single UNIX Specification were constraining system implementations that were other than ILP, the documents in question were reviewed and recommendations drafted to make these specifications architecture neutral. These recommendations have been incorporated into the Single UNIX Specification, Version 2.

Changes for Data Size Neutrality

The following is the summary of the changes for data size neutrality , which have been incorporated into the Single UNIX Specification , Version 2. The changes are identified with respect to the CAE Specifications which make up the Version1 specification.

System Interfaces and Headers

The changes to the System Interfaces and Headers can be characterized in two common ways:

1. Use of the type int for return values, arguments and structure members

Several interfaces using the type int for return values, arguments or structure members will not be able to represent 64-bit values correctly on architectures implementing an LP64 data model, where the Integer data type is 32 bits and the Long and Pointer data types both 64 bits in size. This limitation has been taken into account in the next version of the Single UNIX Specification. Where alternate interfaces are available which do not have this limitation, the interfaces are marked Legacy and the alternate interfaces noted in the Application Usage section. Where no alternative interface is available types have been changed in a data model neutral manner to overcome this limitation.

2. size_t versus ssize_t

Several functions have a parameter declared to be size_t where the parameter specifies the length of an object to manipulate, and return the portion of the length of the object processed in a type ssize_t. The type ssize_t is required so that a negative return value can be used to indicate an error. However, in these routines it is possible for the return value to exceed the range of the type ssize_t (since size_t has a larger range of positive values than ssize_t).

Some routines, such as mq_receive(), msgrcv(), read(), strfmon() and write(), resolve this conflict by restricting the object size in the description section. For example, the description section for the read() routine states:

"If the value of nbyte is greater than {SSIZE_MAX}, the result is implementation-dependent."

Changes to System Interfaces and Headers

The following are the detailed changes made for data size neutrality:

getdtablesize

The getdtablesize() interface returns the size of the file descriptor table. This is equivalent to getrlimit( ) with the RLIMIT_NOFILE option. Whereas the getrlimit( ) function returns a value of type rlim_t. This interface, returning an int, may have problems representing appropriate values in the future. A note about this has been added to Application Usage, and the interface marked Legacy, with the recommendation that applications should use the getrlimit() interface instead.

getpagesize

The getpagesize( ) function returns the current page size. It is equivalent to sysconf (_SC_PAGE_SIZE) and sysconf (_SC_PAGESIZE). This interface, returning an int, may have problems representing appropriate values in the future. Also the behaviour is not specified for this interface on systems that support variable size pages. On variable page size systems, a page can be extremely large (theoretically, up to the size of memory). This allows very efficient address translations for large segments of memory that have common page attributes. A note about this has been added to Application Usage, and the interface marked Legacy, with the recommendation that applications should use the sysconf() interface instead.

readlink

The readlink() function returns the size of the information that it reads as a type int , but the size of the buffer area is specified by a size_t. This interface is being specified in the IEEE PASC P1003.1a draft standard, which currently also has a type int for the return type. A ballot objection has been filed and the return value may change in a future edition of the Single UNIX Specification to reflect the final P1003.1a standard.

sbrk

The parameter to the sbrk() function is a type int defining the number of bytes by which to change the break value. This interface may not be able to address the full memory range in the future for certain data models. A new type has been introduced to be used in place of the type int. This is the intptr_t type which is an opaque data type equating to a signed integral type large enough to hold any pointer. This new type is one of a new set of types introduced in a new header <inttypes.h> to address the issues of data sizes for specific types.

inttypes.h

The <inttypes.h> header is a new header in the Single UNIX Specification, Version 2 and includes definitions of at least the following types:

int16_t   16-bit signed integral type.
int32_t   32-bit signed integral type.
int64_t   64-bit signed integral type.
uint16_t  16-bit unsigned integral type.
uint32_t  32-bit unsigned integral type.
uint64_t  64-bit unsigned integral type.
intptr_t  Signed integral type large enough to hold any pointer
uintptr_t Unsigned integral type large enough to hold any pointer


sys/shm.h

The element shm_segsz of struct shmid_ds, specifying the size of a memory segment is of type int. This has been changed to type size_t.

sys/stat.h and sys/statvfs.h

Changes have been made to the stat and statvfs structs to remove integer values and replace them with new opaque data types, representing file block counts, file system block counts and file serial numbers. This change was submitted by the Large File Summit group, a separate paper detailing their submission is also available.

        stat
                blkcnt_t   st_blocks
        statvfs
                fsblkcnt_t f_blocks, f_bfree, f_bavail
                fsfilcnt_t f_files, f_ffree, f_favail
        types.h
                blkcnt_t      used for file block counts,
                        a signed arithemetic type
                fsblkcnt_t    used for file system block counts,
                        an arithmetic type
                fsfilcnt_t    used for file serial numbers,
                        an arithmetic type



sys/time.h

The tv_usec element of the timeval struct is of type long. This has been changed to use a new opaque data type for signed integral time values , known as a suseconds_t type. The suseconds_t is added to <sys/types.h> in the Single UNIX Specificiation, Version 2.

msgrcv

In Issue 4 Version 2 msgrcv() returns the size of the message received as an integer value, but the size of the message area is specified by a size_t. On 64-bit systems where size_t may be a different data type to int this will cause problems. Issue 5 addresses this problem by changing the type of the return value from int to ssize_t, and adding a warning to the DESCRIPTION about values of msgsz larger the {SSIZE_MAX} (see below).

sysconf and unistd.h

New in the Single UNIX Specification, Version 2, is a way to find out the data model supported by the system. This can be queried at compile time, using the constants defined in <unistd.h>, or at runtime using the sysconf() function.

The following symbolic constants are defined to have the value -1 if the implementation will never provide the feature, and to have a value other than -1 if the implementation always provides the feature. If these are undefined, the sysconf( ) function can be used to determine whether the feature is provided for a particular invocation of the application.

_XBS5_ILP32_OFF32 Implementation provides a C-language compilation environment with 32-bit int, long, pointer and off_t types.

_XBS5_ILP32_OFFBIG Implementation provides a C-language compilation environment with 32-bit int, long and pointer types and an off_t type using at least 64 bits.

_XBS5_LP64_OFF64 Implementation provides a C-language compilation environment with 32-bit int and 64-bit long, pointer and off_t types.

_XBS5_LPBIG_OFFBIG Implementation provides a C-language compilation environment with an int type using at least 32 bits and long, pointer and off_t types using at least 64 bits.

System Interface Definitions

Chapter 10, Utility conventions (page 130), the section which describes the argument syntax of the standard utilities and introduces terminology used throughout the Single UNIX Specification for describing the arguments processed by the utilities is updated so that the maximum value of a numerical argument is allowed to be greater than a 32-bit value, thus permitting support of 64-bit values.

The following text is added to System Interface Definitions Issue 4 Version 2, page 130, point 6, as a fourth bullet item.

"Ranges greater than those listed here are allowed."

Commands and Utilities

A new section of text is added to the end of the first paragraph Section 1.9, Utility Description Defaults. This aligns with requirements in ISO-POSIX.2 (??ed-check), and restates that integer variables and constants used by utilities are permitted to be 64-bit values:

"Integer variables and constants, including the values of operands and option-arguments, used by the utilities listed in this specification shall be implemented as equivalent to the ISO C standard signed long data type. Conversion between types shall be as described in the ISO C standard. The evaluation of arithmetic expressions shall be equivalent to that described in Section 6.3 of the ISO C standard."

Limitations with the existing archive format capacities are noted. Whilst tar and cpio formats are able to support file sizes up to 8 gigabytes, the pax format is not able to handle arbitrary file sizes greater than two gigabytes. There is currently a proposal in ballot in the IEEE PASC Shell and Utilities working group to address this shortcoming.

Application Usage text is added to the cpio and tar manual pages noting the 8 gigabyte limit to supported file sizes. The Future Directions section of the pax manual page is updated to note the possible future change to the pax format to accommodate larger file sizes.

C89, getconf: Programming Environments

The c89 manual page has some new text describing programming environments. All implementations must support one of the defined programming environments by default. Applications are able to use the sysconf() function or the getconf utility to determine which programming environments the implementation supports.
Programming Environment getconf Name
Bits in int Bits in long Bits in pointer Bits in off_t
XBS5_ILP32_OFF32 32 32 32 32
XBS5_ILP32_OFFBIG 32 32 32 ³64
XBS5_LP64_OFF64 32 64 64 64
XBS5_LPBIG_OFFBIG ³32 ³64 ³64 ³64

The c89 manual page also have text describing new support in getconf and sysconf to determine configuration strings for C compiler flags, linker/loader flags and libraries for each supported environment. When an application needs to use a specific programming environment rather than the implementation default programming environment while compiling, the application must first verify that the implementation supports the desired environment. If the desired programming environment is supported, the application must then invoke c89 with the appropriate C compiler flags as the first options for the compile, the appropriate linker/loader flags after any other options but before any operands, and the appropriate libraries at the end of the operands

Programming Environment getconf Name
Use
c89 and cc Arguments
XBS5_ILP32_OFF32 C Compiler Flags Linker/Loader Flags Libraries XBS5_ILP32_OFF32_CFLAGS XBS5_ILP32_OFF32_LDFLAGS XBS5_ILP32_OFF32_LIBS
XBS5_ILP32_OFFBIG C Compiler Flags Linker/Loader Flags Libraries XBS5_ILP32_OFFBIG_CFLAGS XBS5_ILP32_OFFBIG_LDFLAGS XBS5_ILP32_OFFBIG_LIBS
XBS5_LP64_OFF64 C Compiler Flags Linker/Loader Flags Libraries XBS5_LP64_OFF64_CFLAG XBS5_LP64_OFF64_LDFLAGS XBS5_LP64_OFF64_LIBS
XBS5_LPBIG_OFFBIG C Compiler Flags Linker/Loader Flags Libraries XBS5_LPBIG_OFFBIG_CFLAGS XBS5_LPBIG_OFFBIG_LDFLAGS XBS5_LPBIG_OFFBIG_LIBS

.


Read other technical papers.

Read or download the complete Single UNIX Specification from http://www.UNIX-systems.org/go/unix.

Copyright © 1997-1998 The Open Group

UNIX is a registered trademark of The Open Group.