src/lib/libc/locale/mbrtoc16.3

265 lines
7 KiB
Groff

.\" $OpenBSD: mbrtoc16.3,v 1.1 2023/08/20 15:02:51 schwarze Exp $
.\"
.\" Copyright 2023 Ingo Schwarze <schwarze@openbsd.org>
.\" Copyright 2010 Stefan Sperling <stsp@openbsd.org>
.\"
.\" Permission to use, copy, modify, and distribute this software for any
.\" purpose with or without fee is hereby granted, provided that the above
.\" copyright notice and this permission notice appear in all copies.
.\"
.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
.\"
.Dd $Mdocdate: August 20 2023 $
.Dt MBRTOC16 3
.Os
.Sh NAME
.Nm mbrtoc16
.Nd convert one UTF-8 encoded character to UTF-16
.Sh SYNOPSIS
.In uchar.h
.Ft size_t
.Fo mbrtoc16
.Fa "char16_t * restrict pc16"
.Fa "const char * restrict s"
.Fa "size_t n"
.Fa "mbstate_t * restrict mbs"
.Fc
.Sh DESCRIPTION
The
.Fn mbrtoc16
function examines at most
.Fa n
bytes of the multibyte character byte string pointed to by
.Fa s ,
converts those bytes to a wide character,
and encodes the wide character using UTF-16.
In some cases, it is necessary to call this function
twice to convert a single character.
.Pp
Conversion happens in accordance with the conversion state
.Pf * Fa mbs ,
which must be initialized to zero before the application's first call to
.Fn mbrtoc16 .
For this function,
.Pf * Fa mbs
stores information about both the state of the UTF-8 input encoding
and the state of the UTF-16 output encoding.
If the previous call did not return
.Po Vt size_t Pc Ns \-1 ,
.Fa mbs
can safely be reused without reinitialization.
.Pp
The input encoding that
.Fn mbrtoc16
uses for
.Fa s
is determined by the
.Dv LC_CTYPE
category of the current locale.
If the locale is changed without reinitialization of
.Pf * Fa mbs ,
the behaviour is undefined.
.Pp
Unlike
.Xr mbtowc 3 ,
.Fn mbrtoc16
accepts an incomplete byte sequence pointed to by
.Fa s
which does not form a complete character but is potentially part of
a valid character.
In this case, the function consumes all such bytes.
The conversion state saved in
.Pf * Fa mbs
will be used to restart the suspended conversion during the next call.
.Pp
On systems other than
.Ox
that support state-dependent encodings,
.Fa s
may point to a special sequence of bytes called a
.Dq shift sequence ;
see
.Xr mbrtowc 3
for details.
.Pp
The following arguments cause special processing:
.Bl -tag -width 012345678901
.It Fa pc16 No == Dv NULL
The conversion from a multibyte character to a wide character is performed
and the conversion state may be affected, but the resulting wide character
is discarded.
.It Fa s No == Dv NULL
The arguments
.Fa pc16
and
.Fa n
are ignored and starting or continuing the conversion with an empty string
is attempted, discarding the conversion result.
.It Fa mbs No == Dv NULL
An internal
.Vt mbstate_t
object specific to the
.Fn mbrtoc16
function is used instead of the
.Fa mbs
argument.
This internal object is automatically initialized at program startup
and never changed by any
.Em libc
function except
.Fn mbrtoc16 .
.Pp
If
.Fn mbrtoc16
is called with a
.Dv NULL
.Fa mbs
argument and that call returns
.Po Vt size_t Pc Ns \-1 ,
the internal conversion state of
.Fn mbrtoc16
becomes permanently undefined and there is no way
to reset it to any defined state.
Consequently, after such a mishap, it is not safe to call
.Fn mbrtoc16
with a
.Dv NULL
.Fa mbs
argument ever again until the program is terminated.
.El
.Sh RETURN VALUES
.Bl -tag -width 012345678901
.It 0
The bytes pointed to by
.Fa s
form a terminating NUL character.
If
.Fa pc16
is not
.Dv NULL ,
a NUL wide character has been stored in
.Pf * Fa pc16 .
.It positive
.Fa s
points to a valid character, and the value returned is the number of
bytes completing the character.
If
.Fa pc16
is not
.Dv NULL ,
the first UTF-16 code unit of the corresponding wide character
has been stored in
.Pf * Fa pc16 .
If it is an UTF-16 high surrogate, the function needs to be called
again to retrieve a second UTF-16 code unit, the low surrogate.
On
.Ox ,
this happens if and only if the return value is 4,
but this equivalence does not hold on other operating systems
that support input encodings other than UTF-8.
.It Po Vt size_t Pc Ns \-1
.Fa s
points to an illegal byte sequence which does not form a valid multibyte
character in the current locale, or
.Fa mbs
points to an invalid or uninitialized object.
.Va errno
is set to
.Er EILSEQ
or
.Er EINVAL ,
respectively.
The conversion state object pointed to by
.Fa mbs
is left in an undefined state and must be reinitialized before being
used again.
.It Po Vt size_t Pc Ns \-2
.Fa s
points to an incomplete byte sequence of length
.Fa n
which has been consumed and contains part of a valid multibyte character.
The character may be completed by calling the same function again with
.Fa s
pointing to one or more subsequent bytes of the multibyte character and
.Fa mbs
pointing to the conversion state object used during conversion of the
incomplete byte sequence.
.It Po Vt size_t Pc Ns \-3
The second 16-bit code unit resulting from a previous call
has been stored into
.Pf * Fa pc16 ,
without consuming any additional bytes from
.Fa s .
.El
.Sh ERRORS
.Fn mbrtoc16
causes an error in the following cases:
.Bl -tag -width Er
.It Bq Er EILSEQ
.Fa s
points to an invalid multibyte character.
.It Bq Er EINVAL
.Fa mbs
points to an invalid or uninitialized
.Vt mbstate_t
object.
.El
.Sh SEE ALSO
.Xr c16rtomb 3 ,
.Xr mbrtowc 3 ,
.Xr setlocale 3
.Sh STANDARDS
.Fn mbrtoc16
conforms to
.St -isoC-2011 .
.Sh HISTORY
.Fn mbrtoc16
has been available since
.Ox 7.4 .
.Sh CAVEATS
On operating systems other than
.Ox
that support input encodings other than UTF-8, inspecting the return value
is insufficient to tell whether the function needs to be called again.
If the return value is positive, inspecting
.Pf * Fa pc16
is also required to make that decision.
Consequently, passing a
.Dv NULL
pointer for the
.Fa pc16
argument is discouraged because it can result
in a well-defined but unknown output encoding state.
The simplest way to recover from such an unknown state is to
reinitialize the object pointed to by
.Fa mbs .
.Pp
The C11 standard only requires the
.Fa pc16
argument to be encoded according to UTF-16
if the predefined environment macro
.Dv __STDC_UTF_16__
is defined with a value of 1.
On
.Ox ,
.In uchar.h
provides this definition.
Other operating systems which do not define
.Dv __STDC_UTF_16__
could theoretically use a different,
implementation-defined output encoding for
.Fa pc16
instead of UTF-16.
Writing portable code for an arbitrary output encoding is impossible
because the rules when and how often the function needs to be called
again depend on the output encoding; the rules explained above are
specific to UTF-16.
Using UTF-16 as the output encoding of
.Fn wcrtoc16
becomes mandatory in C23.