Ask Your Question
0

tshark displays Hebrew (and other non-ISO 8859/1) characters in SMB file names as question marks

asked 2018-06-11 07:26:00 +0000

this post is marked as community wiki

This post is a wiki. Anyone with karma >750 is welcome to improve it.

Hi all,

I'm running tshark on ubuntu 16.04 with support for all locales. However, when a field contains hebrew characters for instance, all I see is \u009d characters or question marks instead. Can tshark display these characters ?

Thanks, B

edit retag flag offensive close merge delete

Comments

What Wireshark version are we talking about here? And what field does this concern?

Jaap gravatar imageJaap ( 2018-06-11 11:46:19 +0000 )edit

TShark (Wireshark) 2.6.1 (v2.6.1). fields such as smb.file I enabled all locales and configured tshark with --localedir=/usr/share/locale. thanks

Bill DePing gravatar imageBill DePing ( 2018-06-11 12:15:11 +0000 )edit

Some more questions: Of these SMB packets (with smb.file), in the SMB header, does Flags2 indicate Unicode strings?

Jaap gravatar imageJaap ( 2018-06-11 12:49:37 +0000 )edit

I see 0x0000c843 or 0x00008001 in smb.flags2 field and in both cases, smb.file is question marks

thanks

Bill DePing gravatar imageBill DePing ( 2018-06-11 13:09:29 +0000 )edit

Interestingly, when tshark is reading a file I see question marks or the field value as hebrew. When tshark is listening on a NIC I see question marks or unicode escaped chars. both NIC and pcap have the same traffic and again, the escaped unicode chars are translated into gibberish instead of hebrew. Thanks

Bill DePing gravatar imageBill DePing ( 2018-06-11 13:19:49 +0000 )edit

2 Answers

Sort by ยป oldest newest most voted
0

answered 2018-06-12 01:21:50 +0000

Guy Harris gravatar image

The get_unicode_or_ascii_string() routine, used in several places in the SMB and SMB2/SMB3 dissectors and in some other places, doesn't just fetch UTF-16 when fetching a Unicode string - it turns everything that doesn't look like a character from ISO 8859-1 into a question mark.

That may be fine for Western European languages, but it's completely broken for everything else.

That routine probably long antedates Wireshark's ability to handle UTF-16 in its full glory, and needs to be fixed or just replaced.

So, yes, please file a bug, with a sample capture; we definitely need to fix or replace get_unicode_or_ascii_string() and, having done that, we may also need to change the display format for the fields mentioned in the other answer.

edit flag offensive delete link more
0

answered 2018-06-11 15:32:08 +0000

JeffMorriss gravatar image

updated 2018-06-11 15:48:53 +0000

grahamb gravatar image

smb.file is an FT_STRING with BASE_NONE. Based on that Wireshark thinks the contents of those fields should not be Unicode. I don't know much about SMB but I'd think that it can be Unicode.

Changing smb.file and smb2.filename to use STR_UNICODE would probably fix the issue.

I'd suggest opening a bug report; please be sure to include a sample capture.

edit flag offensive delete link more

Comments

SMB version 1 supports both local-code-page and UTF-16-encoded (formerly UCS-2-encoded) Unicode strings. We currently treat the local-code-page strings as ASCII; supporting local code pages would require 1) a preference to indicate which local code page is being used and 2) support for that local code page in the "translate a string in a packet to UTF-8" code. We should currently support the UTF-16-encoded Unicode strings, although we may not display them correctly with BASE_NONE.

The 0x8000 bit in flags2 is the "strings are Unicode" (UTF-16-encoded), so the strings are Unicode in those packets.

Almost all strings in SMB2/SMB3 are Unicode.

Guy Harris gravatar imageGuy Harris ( 2018-06-12 01:04:04 +0000 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2018-06-11 07:26:00 +0000

Seen: 948 times

Last updated: Jun 12 '18