The USB software sniffers (here: USBPcap) capture URBs and not the raw USB packets. URB is USB Request Block and is what the device driver (here: USB Audio driver) submits to host controller driver (here: most likely xHCI). Each and every URB is always shown in Wireshark as two "packets":
- "submit" one always from host to endpoint, regardless of endpoint (i.e. actual data flow) direction
- "complete" one always from endpoint to host, regardless of endpoint (i.e. actual data flow) direction
Expand the "USB URB" in Protocol Tree to find the matching "submit" for the visible "complete" packet.
For working streaming the driver would really need to have multiple URBs rolled over - otherwise the host OS latencies would be clearly audible. In the screenshot "packet" 49063 is most likely resubmitting the just completed URB, but likely there are at least 2 "IN" URBs involved - you can check how many URBs are involved by looking at usb.irp_id (USBPcap) or usb.urb_id (usbmon) in the "packets" addressed to endpoint in question.
Queuing multiple URBs is not really enough to cope up USB Audio timing. For Full-Speed device we are talking IN DATA packet every 1 ms and for High-Speed it IN DATA packet every 125 us (=1/8 ms; not to mention high-bandwidth endpoints where there can be up to three DATA packets every microframe). In your capture USB Audio driver is grouping the transfers in 10 frames chunks. Doing so essentially reduces CPU overhead 10 times, because makes the "top level handling" happen only once for every 10 actual packets. This combined together with multiple URBs is what makes host possible to cope up with the USB Audio timing.
You noted that individual packet is 96 bytes. This is not quite accurate. What you mean is that there is 96 bytes of payload every bInterval frames. The actual DATA0 packet you would see on bus if you captured in hardware would be 1 (DATA0) + 96 (actual payload) + 2 (CRC16) bytes long. You don't see the SOFs at all when capturing in software, but each "USB isochronous packet" shown in the screenshot is one bInterval apart from the previous one. If the packet would fail (e.g. CRC error) you would see different ISO USBD status than USBD_STATUS_SUCCESS.
Similar story applies to the bulk endpoint. You see 16407 bulk in, but that's definitely more than bulk endpoint wMaxPacketSize (which is 8 or 16 or 32 or 64 on Full-Speed and 512 on High-Speed device). If you looked at the data with hardware sniffer you would see alternating DATA0/DATA1 packets with wMaxPacketSize bytes of payload. The reassembly is handled by host controller and USBPcap (or usbmon if on Linux) never really sees the individual DATA packets.
If you want to see what the actual USB packets are being transmitted on the bus and how that compares to URBs, check out the USB Link Layer Sample Captures that contain multiple pcap showing exactly the same traffic captured by different sniffers.