1

假设我有

uint32_t a(3084);

我想创建一个存储 unicode 字符的字符串,U+3084这意味着我应该取 的值a并将其用作 UTF8 表/字符集中正确字符的坐标。

现在,显然std::to_string()对我不起作用,标准中有很多函数可以在数值和字符之间进行转换,我找不到任何支持 UTF8 并输出std::string.

我想问一下我是否必须从头开始创建这个函数,或者 C++11 标准中有什么可以帮助我的;请注意,我的编译器 (gcc/g++ 4.8.1) 不提供对codecvt.

4

4 回答 4

9

这是一些不难转换为 C 的 C++ 代码。改编自较旧的答案

std::string UnicodeToUTF8(unsigned int codepoint)
{
    std::string out;

    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}
于 2013-11-14T03:30:29.517 回答
7

std::string_convert::to_bytes有一个单字符重载,只适合你。

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <iomanip>

// utility function for output
void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    for(unsigned char c : s)
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
    std::cout << std::dec << '\n';
}

int main()
{
    uint32_t a(3084);

    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv1;
    std::string u8str = conv1.to_bytes(a);
    std::cout << "UTF-8 conversion produced " << u8str.size() << " bytes:\n";
    hex_print(u8str);
}

我得到(使用 libc++)

$ ./test
UTF-8 conversion produced 3 bytes:
e0 b0 8c 
于 2013-11-14T03:38:23.333 回答
1

C++ 标准包含std::codecvt<char32_t, char, mbstate_t>根据 22.4.1.4 [locale.codecvt] 第 3 段在 UTF-32 和 UTF-8 之间转换的方面。遗憾的是,这些std::codecvt<...>方面并不易于使用。在某些时候,有人讨论过过滤流缓冲区,这将涉及代码转换(标准 C++ 库无论如何都需要实现它们std::basic_filebuf<...>),但我看不到这些的任何痕迹。

于 2013-11-14T03:21:58.597 回答
0
auto s = u8"\343\202\204"; // Octal escaped representation of HIRAGANA LETTER YA
std::cout << s << std::endl;

prints

for me (using g++ 4.8.1). s has type const char*, as you'd expect, but I don't know if this is implementation defined. Unfortunately C++ doesn't have any support for manipulation of UTF8 strings are far as I know; for that you need to use a library like Glib::ustring.

于 2013-11-14T03:19:15.500 回答