Data Types & Character Sets

Data Types in Win32 API

Windows does not widely use standard C/C++ data types but instead uses a collection of type-defined data types found with the windows.h header file. A selection of these is listed below

BOOL – This type of data has two values – 0 or 1.
BYTE – The same as unsigned char. Declared as typedef unsigned char BYTE,
DWORD – 32-bit unsigned integer.
INT – 32-bit integer. It is declared as typedef int INT.
LONG – A 32-bit signed integer.
UINT – 32-bit unsigned integer. It is declared as typedef unsigned int UINT.
HANDLE – 32-bit integer used to identify a resource.
HBITMAP – Handle to a bitmap.
HBRUSH – Handle to a brush.
HCURSOR – Handle to a cursor.
HDC – A device context handle.
HFONT – Handle to a font.
HINSTANCE – Handle to the application instance.
HMENU – Handle to a menu.
HPEN – Handle to a pen.
HWND – Handle to a window.
LPCSTR – 32-bit pointer to a constant null-terminated 8-bit Windows (ANSI) character string.
LPCWSTR – a 32-bit pointer to a constant 16-bit Unicode character string, which may be null-terminated
LPCTSTR – An LPCWSTR if UNICODE is defined, an LPCSTR otherwise.
LPSTR – A 32-bit pointer to a string of 8-bit characters, which MAY be null-terminated.
LPWSTR – is a 32-bit pointer to a string of 16-bit Unicode characters, which MAY be null-terminated.
LPTSTR – An LPWSTR if UNICODE is defined, an LPSTR otherwise
TCHAR – A WCHAR if UNICODE is specified, a CHAR otherwise.
LPARAM – A message parameter.
LRESULT – Value, returned by the window procedure of type long.
WPARAM – A message parameter.

For a full list of Windows data types
https://docs.microsoft.com/en-us/windows/win32/winprog/windows-data-types

Identifier Constants

Every Windows program will feature a large number of identifiers. These are constants used to represent numerical values. These will typically be in uppercase and consist of a two or three-letter prefix donating the general category, followed by an underscore and the constant name. A selection of type prefixes and associated messages is listed below

Prefix	Description	Example
CS	Class style	CS_HREDRAW \| CS_VREDRAW
CW	Create window	CW_USEDEFAULT CW_USEDEFAULT
DT	Draw text	DT_CENTER DT_LEFT DT_RIGHT
IDI	Icon identifier	IDI_ASTERISK IDI_ERROR IDI_HAND
IDC	Cursor identifier	IDC_ARROW IDC_HAND
MB	Message box options	MB_HELP MB_OK MB_OKCANCEL
SND	Sound option	SND_ASYNC SND_NODEFAULT
WM	Window message	WM_NULL WM_CREATE WM_DESTROY
WS	Window style	WS_OVERLAPPED WS_SYSMENU WS_BORDER

Naming conventions

Microsoft follows a set naming convention known as Hungarian notation. Hungarian notation uses short, lowercase prefixes to indicate the data type followed by the variable name, which begins with a capital letter. Function names should start with a capital letter and no type prefix. For further reading on MS coding style conventions

https://docs.microsoft.com/en-us/windows/win32/stg/coding-style-conventions

Character sets

Text and numbers are encoded in a computer as patterns of binary digits known as character codes. For computers to communicate there must be an agreed standard that defines which character code is used for which character. A complete collection of characters is a character set. Two common character sets are ASCII and Unicode.

ASCII

ASCII is a character encoding system that can represent 128 characters. It uses 7 bits to represent each character since the first bit of the byte is always 0. The code set allows 95 printable characters and 33 non-printable Control characters.

Extended ASCII

Although the 128 characters supported by standard ASCII are enough to represent all the standard English characters, they cannot represent all the special characters in other languages. Extended ASCII uses eight bits to represent a character as opposed to seven. Despite extended ASCII doubling the number of characters available, it does not include nearly enough characters to support all languages therefore other forms of character encoding such as Unicode are now commonly used.

UNICODE

The Unicode Standard is a universal character-encoding standard that can represent data in any combination of languages by assigning a unique code, known as a code point, to every character and symbol in that language. A Unicode transformation format (UTF) is an algorithmic mapping of every Unicode code point to a unique byte sequence. The two most common Unicode implementations for encoding the Unicode standard are UTF-8 and UTF-16.

UTF-8 – A character in UTF8 can be from 1 to 4 bytes long. The first 128 Unicode codes are the same as ASCII making it backward compatible. This backward compatibility is useful for older API functions. UTF-8 is the preferred encoding for e-mail and web pages.

UTF-16 – is a variable-length character set, with a minimum of two bytes(16 bits). UTF-16 is not backward compatible with ASCII. In Windows, strings are either ANSI or UTF-16LE.

Unicode in the Windows API

Unicode has been standard in Windows since Windows NT. Windows API functions that use or return a string are generally implemented in one of three formats: a version that is based on ANSI (called “A”), a wide version (called “W“) to deal with Unicode, and a generic function prototype. The generic prototype gets resolved into one of the other two function prototypes at compile time by the addition of a single character suffix to the generic root function name. For instance, the generic root function CreateWindowEx can be suffixed with an ‘A’ (indicating ANSI) or ‘W’ (indicating Unicode) depending on the compilation environment.

Working with Strings

C++ has 4 built-in character types: char, wchar_t, char16_t, and char32_t. C and C++ introduced fixed-size character types char16_t and char32_t in 2011 to deal with the UTF-16 and UTF-32 formats. Since the width of wchar_t is compiler-specific any program that needs to be compiler-portable should avoid using wchar_t for storing Unicode text.

Any string literal should also use the prefix L,u, or U to indicate a wchar_t, char16_t, and char32_t character string.

char *ascii_example = "This is an ASCII string."; 
wchar_t *Unicode_example = L"This is a wide char string."; 
char16_t * char16_example = u"This is a char16_t Unicode string.";
har32_t * char32_example = U"This is a char32_t Unicode string.";

TCHAR and the TEXT Macro

To make applications portable between Unicode and non-Unicode systems, Microsoft introduced the macro TCHAR. When a developer needs to support Unicode and earlier non-Unicode compliant operating systems, TCHAR enables the compilation of the same code in either environment by automatically mapping strings to Unicode or ANSI. To complement TCHAR, the TEXT() or _T() macro can automatically define a string as Unicode or ANSI. For example

TCHAR *autostring = TEXT("This message can be either ASCII or UNICODE!");

For further detailed reading on dealing with character encoding –
https://docs.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings

Details: Published: 25 October 2020; Created: 25 October 2020; Last Updated: 23 March 2024; Hits: 1657

Data Types and Character Sets