This is expected, the vlan expression adjusts subequent offsets. From the man page you mentioned:

Note that the first vlan keyword encountered in expression changes the decoding offsets for the remainder of expression on the assumption that the packet is a VLAN packet.