tshark displays Hebrew (and other non-ISO 8859/1) characters in SMB file names as question marks

asked 2018-06-11 07:26:00 +0000

This post is a wiki. Anyone with karma >750 is welcome to improve it.

Hi all,

I'm running tshark on ubuntu 16.04 with support for all locales. However, when a field contains hebrew characters for instance, all I see is \u009d characters or question marks instead. Can tshark display these characters ?

Thanks, B

edit retag flag offensive close merge delete

Comments

What Wireshark version are we talking about here? And what field does this concern?

Jaap ( 2018-06-11 11:46:19 +0000 )edit

TShark (Wireshark) 2.6.1 (v2.6.1). fields such as smb.file I enabled all locales and configured tshark with --localedir=/usr/share/locale. thanks

Bill DePing ( 2018-06-11 12:15:11 +0000 )edit

Some more questions: Of these SMB packets (with smb.file), in the SMB header, does Flags2 indicate Unicode strings?

Jaap ( 2018-06-11 12:49:37 +0000 )edit

I see 0x0000c843 or 0x00008001 in smb.flags2 field and in both cases, smb.file is question marks

thanks

Bill DePing ( 2018-06-11 13:09:29 +0000 )edit

Interestingly, when tshark is reading a file I see question marks or the field value as hebrew. When tshark is listening on a NIC I see question marks or unicode escaped chars. both NIC and pcap have the same traffic and again, the escaped unicode chars are translated into gibberish instead of hebrew. Thanks

Bill DePing ( 2018-06-11 13:19:49 +0000 )edit

add a comment

2 Answers

Sort by » oldest newest most voted

answered 2018-06-12 01:21:50 +0000

Guy Harris
19905 ●3 ●672 ●207

The get_unicode_or_ascii_string() routine, used in several places in the SMB and SMB2/SMB3 dissectors and in some other places, doesn't just fetch UTF-16 when fetching a Unicode string - it turns everything that doesn't look like a character from ISO 8859-1 into a question mark.

That may be fine for Western European languages, but it's completely broken for everything else.

That routine probably long antedates Wireshark's ability to handle UTF-16 in its full glory, and needs to be fixed or just replaced.

So, yes, please file a bug, with a sample capture; we definitely need to fix or replace get_unicode_or_ascii_string() and, having done that, we may also need to change the display format for the fields mentioned in the other answer.

edit flag offensive delete link

add a comment

answered 2018-06-11 15:32:08 +0000

JeffMorriss

6396 ●30 ●78

updated 2018-06-11 15:48:53 +0000

grahamb

flag of United Kingdom of Great Britain and Northern Ireland

23850 ●4 ●989 ●227 https://www.wireshark.org

smb.file is an FT_STRING with BASE_NONE. Based on that Wireshark thinks the contents of those fields should not be Unicode. I don't know much about SMB but I'd think that it can be Unicode.

Changing smb.file and smb2.filename to use STR_UNICODE would probably fix the issue.

I'd suggest opening a bug report; please be sure to include a sample capture.

edit flag offensive delete link

Comments

SMB version 1 supports both local-code-page and UTF-16-encoded (formerly UCS-2-encoded) Unicode strings. We currently treat the local-code-page strings as ASCII; supporting local code pages would require 1) a preference to indicate which local code page is being used and 2) support for that local code page in the "translate a string in a packet to UTF-8" code. We should currently support the UTF-16-encoded Unicode strings, although we may not display them correctly with BASE_NONE.

The 0x8000 bit in flags2 is the "strings are Unicode" (UTF-16-encoded), so the strings are Unicode in those packets.

Almost all strings in SMB2/SMB3 are Unicode.

Guy Harris ( 2018-06-12 01:04:04 +0000 )edit

add a comment