in TCP Follow Stream Window, can support of CP1252 encoding be added

answered 2019-10-01 22:38:35 +0000

Guy Harris
19905 ●3 ●672 ●207

If you mean "interpret the characters in the byte stream as CP1252 characters, so that bytes with the 8th bit set are to be interpreted as being in CP1252, that could probably be added. File an enhancement request on the Wireshark Bugzilla.

Wireshark will, however, continue to use UTF-8 internally, with Qt using, I think, UTF-16 inside its strings. It's not ever going to be using CP1252 internally. If you want to use your font, it'd better be a font that Qt can use.

edit flag offensive delete link

Comments

if you followed the link, you will see how QT is the main example in my font preview!.

QT supports CP1252 Just fine. it's all about a encoding API call.

Dernyn ( 2019-10-16 21:17:43 +0000 )edit

As it's apparently so simple please submit your change as detailed on the Wiki page Submitting patches.

grahamb ( 2019-10-17 10:08:43 +0000 )edit

if you followed the link,

I DID follow the link. The only thing it shows about Qt is a display inside Qt Creator. None of that indicates what you're asking for.

Your text first speaks of non-Unicode encodings:

This Font works best with character Region English - Western/Latin1 encoding providing single-byte character encoding with CP1252/Windows-1252/IBM819/CP819/iso-ir-100/csISO-Latin1/ibm-5348 and ISO-8859-1 depending on Text editor support.
This Font does not support UTF-8 encoding, Inherently Fails to fully encode in UTF-8 due to it's native lack of single-byte character support by the standard, which is limited to ASCII-7 (128).

but then says it's a Unicode-encoded font:

It is an ISO 10646-1(Unicode,BMP) encoded True-Type Font(TTF)

If it's a Unicode-encoded font, so that any software that handles Unicode can use it to display, at least, the subset of Unicode that includes the characters in ...(more)

Guy Harris ( 2019-10-17 16:26:30 +0000 )edit

@grahamb thanks for all your help, It seems I have angered you by my comments. I did Open the enhancement request as indicated. I may just make the changes in the code myself and provide a patch, but I was avoiding having to do so if it's a new add feature, I don't get the push back or the shaming for my suggestion that it can't be that hard, as if what I am requesting is impossible to do. it's a native API call in the Qt SDK, it can't be that hard is all I'm indicating.

Qt Creator is no different than Qt as a GTK+ programming interface with support for all these different encoding within itself , what I hence at is that if it works in Qt creator it works in any Qt API call by other apps. why does it feels ...(more)

Dernyn ( 2019-10-17 17:32:42 +0000 )edit

@Dernyn, no problem at all, but as you seem to be the only person requesting this change it's unlikely to happen unless you step up to the plate and submit the required change, this isn't being awkward or pushing back, just stating reality.

I didn't see any mention in the question or comments that an item has been raised on Bugzilla.

The Wireshark devs are generally welcoming and grateful for all changes submitted and requests made, but as we are all volunteers (except Gerald) we don't have much spare time for investigating and implementing what seems to be an esoteric request.

grahamb ( 2019-10-17 17:51:56 +0000 )edit

I did Open the enhancement request as indicated.

Presumably that's bug 16137.

Guy Harris ( 2019-10-17 18:00:34 +0000 )edit

So, just because a font is Unicode capable - meaning it can allocate mappings in the Unicode regions( ISO 10646-1(Unicode,BMP), it does not mean it covers UTF-8. UTF-8 is not the defacto Unicode standard or mapping. CP1252/ISO-8859-1 is covered/supported in the Unicode mappings and that's why my font works the way it does.
if you tried to use my font in a UTF-8 encoding, it will not render correctly because is not a UTF-8/16 Font, although it is a Unicode Font. Unicode does not just means UTF-8 support, that's just a subset of the Unicode available mapping.
I also can't have UTF-8 and CP1252/ISO-8859-1 in one font, they are two different subsets encodings of Unicode, supporting different sections of the Unicode Map.

Unicode assigns "a unique number for every character, no matter what platform, device, application or language."

UTF-8 and UTF-16 are ...(more)

Guy Harris ( 2019-10-17 18:48:55 +0000 )edit

This is how Qt Creator does it.

QString utf16LineTextInUtf8Buffer(const QByteArray &utf8Buffer, int currentUtf8Offset)
{
    const int lineStartUtf8Offset = currentUtf8Offset
                                        ? (utf8Buffer.lastIndexOf('\n', currentUtf8Offset - 1) + 1)
                                        : 0;
    const int lineEndUtf8Offset = utf8Buffer.indexOf('\n', currentUtf8Offset);
    return QString::fromUtf8(
        utf8Buffer.mid(lineStartUtf8Offset, lineEndUtf8Offset - lineStartUtf8Offset));
}

static bool isByteOfMultiByteCodePoint(unsigned char byte)
{
    return byte & 0x80; // Check if most significant bit is set
}

bool utf8AdvanceCodePoint(const char *&current)
{
    if (Q_UNLIKELY(*current == '\0'))
        return false;

    // Process multi-byte UTF-8 code point (non-latin1)
    if (Q_UNLIKELY(isByteOfMultiByteCodePoint(*current))) {
        unsigned trailingBytesCurrentCodePoint = 1;
        for (unsigned char c = (*current) << 2; isByteOfMultiByteCodePoint(c); c <<= 1)
            ++trailingBytesCurrentCodePoint;
        current += trailingBytesCurrentCodePoint + 1;

    // Process single-byte UTF-8 code point (latin1)
    } else {
        ++current;
    }

    return true;
}

Dernyn ( 2019-10-17 18:58:25 +0000 )edit

it's supported by it's API using

QTextCodec; QByteArray;

Dernyn ( 2019-10-17 19:03:06 +0000 )edit

What is "it" in "This is how Qt Creator does it."? I.e., what is that supposed to explain?

utf16LineTextInUtf8Buffer():

takes a UTF-8 buffer and an offset in the buffer;
finds the beginning and end of the line that includes the byte at that offset;
converts that line into a QString, which involves converting it from UTF-8 to UTF-16.

isByteOfMultiByteCodePoint() looks at a byte, presumably taken from a UTF-8 string and determines whether it's the one and only byte of a character encoded in UTF-8 as one byte or is a byte in a multi-byte UTF-8 encoding (by checking whether the high-order bit is set).

utf8AdvanceCodePoint() takes a pointer into a sequence of chars and advances that pointe to point to the beginning of the next UTF-8 character.

Guy Harris ( 2019-10-17 19:14:06 +0000 )edit

It's supported by it's API using QTextCodec; QByteArray;

Yes, that's what I mentioned in the enhancement request.

Guy Harris ( 2019-10-17 19:14:49 +0000 )edit

You are just re-explaining what I'm saying about UTF-8; 1. it does not cover all the Unicode plane, particularly the BMP -- it does not include Plane 0 or the BMP in it's 32 planes regions and therefore I can't implemented. 2.My font doesn't just handle the BMP, it can handle all of the Unicode mapping, UTF-8 can not!.

https://en.wikipedia.org/wiki/Plane_(...

Dernyn ( 2019-10-17 19:16:29 +0000 )edit

You are just re-explaining what I'm saying about UTF-8;

No, I'm explaining that what you're saying about UTF-8 is incorrect.

it does not cover all the Unicode plane, particularly the BMP

Wrong. Unicode has 1,114,112 code points, with numerical values from 0x000000 to 0x10FFFF (not all values in that range are valid code points). See 2.4 "Code Points and Characters" in chapter 2 of the Unicode 12.0 standard.

UTF-8 can encode EVERY SINGLE ONE of those 1,114,112 as a sequence of 1 to 4 bytes. See D92 on page 125 of chapter 3 of the Unicode 12.0 standard.

it does not include Plane 0 or the BMP

Plane 0 IS the BMP, and UTF-8 most definitely includes the BMP. It definitely includes the letter "A", for example, which is in the BMP as U+00041 and which is encoded in ...(more)

Guy Harris ( 2019-10-17 20:12:25 +0000 )edit

UTF-8 can encode EVERY SINGLE ONE of those 1,114,112 as a sequence of 1 to 4 bytes. not the way CP-1252 encodes them.

Dernyn ( 2019-10-17 20:17:41 +0000 )edit

the font does not have a glyph in U+1F61E currently - it only currently has my specifically defined glyphs, but I can place one in it and it will render under cp1252 or UTF-8

Dernyn ( 2019-10-17 20:19:42 +0000 )edit

if a font is specifically UTF-8 in TTF, it removes/wont use the C0 and C1 regions when rendering due to the encoding

Dernyn ( 2019-10-17 20:21:55 +0000 )edit

UTF-8 can encode EVERY SINGLE ONE of those 1,114,112 as a sequence of 1 to 4 bytes. not the way CP-1252 encodes them.

Every single glyph in CP 1252 has a Unicode code point - even the glyphs in the "box drawing" extension for DOS portion. Every single Unicode code point has a UTF-8 encoding, so every single glyph in CP 1252 can be encoded in UTF-8.

The encoding of all CP 1252 code points other than those in the range 0x20 through 0x7f will be different in CP 1252 than in UTF-8, but that's irrelevant. Wireshark doesn't use CP 1252 internally, and it never will use CP 1252 internally - it uses Unicode internally. The "core dissector" engine uses UTF-8-encoded Unicode, as did the old GTK+ GUI part; the Qt GUI part uses QStrings heavily, so mostly uses UTF-16, but anything it gets from the rest ...(more)

Guy Harris ( 2019-10-17 20:33:52 +0000 )edit

if a font is specifically UTF-8 in TTF, it removes/wont use the C0 and C1 regions when rendering due to the encoding

Then, as I said in bug 16137:

According to the Wikipedia page on code page 1252:
https://en.wikipedia.org/wiki/Windows...
the use of code points 0x00 through 0x1F for special graphics characters, rather than as control characters, is a "rarely used, but useful, graphics extended code page 1252 where codes 0x00 to 0x1f allow for box drawing as used in applications such as MSDOS Edit and Codeview".
Supporting that would mean mapping those code points to the corresponding Unicode characters for the glyphs used; see
https://en.wikipedia.org/wiki/Windows...
for the mapping in question.
I don't know whether Qt's codec for code page 1252 does that; if not, and if the intent is to treat those code points as graphics rather ...

(more)

Guy Harris ( 2019-10-17 20:37:19 +0000 )edit

the font does not have a glyph in U+1F61E currently - it only currently has my specifically defined glyphs, but I can place one in it and it will render under cp1252 or UTF-8

For the font to be fully usable as a primary font for Wireshark, it will need all of Unicode.

Guy Harris ( 2019-10-17 20:38:19 +0000 )edit

it's about TTF and the use of Unicode, not just the general knowledge of UTF-8

my font supports Unicode with CP1252 encoding, and can map to anything in Unicode. the problem is with UTF-8 it encodes differently with the OSes and TTF

Dernyn ( 2019-10-17 20:38:56 +0000 )edit

For the font to be fully usable as a primary font for Wireshark, it will need all of Unicode.

it can still be useful for byte representation, I can use it everywhere else in wireshark, except this window.

I only care about 0x00 through 0x1F within the font, UTF-8 has the first c0 but not c1- as they are represented elsewhere in the font mapping.

Dernyn ( 2019-10-17 20:45:01 +0000 )edit

my font supports Unicode with CP1252 encoding

There is no such thing as "Unicode with CP 1252 encoding". CP 1252 is a single-byte character set that provides encodings for characters that are also in Unicode, but that's only a small subset of Unicode.

TrueType fonts can support multiple character encodings - that's what the 'cmap' tables are about. The 'cmap' tables are seen only by character drawing code in Qt or GTK+ or the macOS GUI code or the Windows GUI code or.... The Qt and GTK+ and macOS and Windows GUI APIs all use Unicode, either with UTF-16 or UTF-8 encodings, and take Unicode characters and find the appropriate glyph in the font. They would use a Unicode 'cmap' table (one of the ones with a platformID of 0, or one of the other ones that uses a Unicode encoding rather than some other encoding.

and can map ...

(more)

Guy Harris ( 2019-10-17 20:55:39 +0000 )edit

For the font to be fully usable as a primary font for Wireshark, it will need all of Unicode.
it can still be useful for byte representation, I can use it everywhere else in wireshark, except this window.

It can only be useful for byte representation for ASCII and CP 1252.

For EBCDIC, it might not be useful if it lacks glyphs for all the relevant characters in the EBCDIC code page being used (or in the "invariant" part of EBCDIC - I'm not sure which one we're using for "EBCDIC" in that case.

For UTF-8 and UTF-16, it could require much more of Unicode, unless you're OK with a lot of "replacement character" glyphs in the window.

That might apply for other character encodings as well.

I only care about 0x00 through 0x1F within the font, UTF-8 has the first c0 but not c1- as they are ...

(more)

Guy Harris ( 2019-10-17 21:05:26 +0000 )edit

I appreciate the help, can we work on this via some other chat process.

it's different with Fonts https://fontforge.github.io/encodingm...

Dernyn ( 2019-10-17 21:06:15 +0000 )edit

see more comments

in TCP Follow Stream Window, can support of CP1252 encoding be added

1 Answer

Comments

Your Answer

Question Tools

Stats

in TCP Follow Stream Window, can support of CP1252 encoding be added edit

1 Answer

Comments

Your Answer

Question Tools

Stats

in TCP Follow Stream Window, can support of CP1252 encoding be added