# tshark displays Hebrew (and other non-ISO 8859/1) characters in SMB file names as question marks

This post is a wiki. Anyone with karma >750 is welcome to improve it.

Hi all,

I'm running tshark on ubuntu 16.04 with support for all locales. However, when a field contains hebrew characters for instance, all I see is \u009d characters or question marks instead. Can tshark display these characters ?

Thanks, B

edit retag close merge delete

What Wireshark version are we talking about here? And what field does this concern?

( 2018-06-11 11:46:19 +0000 )edit

TShark (Wireshark) 2.6.1 (v2.6.1). fields such as smb.file I enabled all locales and configured tshark with --localedir=/usr/share/locale. thanks

( 2018-06-11 12:15:11 +0000 )edit

Some more questions: Of these SMB packets (with smb.file), in the SMB header, does Flags2 indicate Unicode strings?

( 2018-06-11 12:49:37 +0000 )edit

I see 0x0000c843 or 0x00008001 in smb.flags2 field and in both cases, smb.file is question marks

thanks

( 2018-06-11 13:09:29 +0000 )edit

Interestingly, when tshark is reading a file I see question marks or the field value as hebrew. When tshark is listening on a NIC I see question marks or unicode escaped chars. both NIC and pcap have the same traffic and again, the escaped unicode chars are translated into gibberish instead of hebrew. Thanks

( 2018-06-11 13:19:49 +0000 )edit

Sort by » oldest newest most voted

The get_unicode_or_ascii_string() routine, used in several places in the SMB and SMB2/SMB3 dissectors and in some other places, doesn't just fetch UTF-16 when fetching a Unicode string - it turns everything that doesn't look like a character from ISO 8859-1 into a question mark.

That may be fine for Western European languages, but it's completely broken for everything else.

That routine probably long antedates Wireshark's ability to handle UTF-16 in its full glory, and needs to be fixed or just replaced.

So, yes, please file a bug, with a sample capture; we definitely need to fix or replace get_unicode_or_ascii_string() and, having done that, we may also need to change the display format for the fields mentioned in the other answer.

more

smb.file is an FT_STRING with BASE_NONE. Based on that Wireshark thinks the contents of those fields should not be Unicode. I don't know much about SMB but I'd think that it can be Unicode.

Changing smb.file and smb2.filename to use STR_UNICODE would probably fix the issue.

I'd suggest opening a bug report; please be sure to include a sample capture.

more

SMB version 1 supports both local-code-page and UTF-16-encoded (formerly UCS-2-encoded) Unicode strings. We currently treat the local-code-page strings as ASCII; supporting local code pages would require 1) a preference to indicate which local code page is being used and 2) support for that local code page in the "translate a string in a packet to UTF-8" code. We should currently support the UTF-16-encoded Unicode strings, although we may not display them correctly with BASE_NONE.

The 0x8000 bit in flags2 is the "strings are Unicode" (UTF-16-encoded), so the strings are Unicode in those packets.

Almost all strings in SMB2/SMB3 are Unicode.

( 2018-06-12 01:04:04 +0000 )edit