You are on your way to implement an RTP receiving endpoint, with all the niceties of jitter buffering, wander, codec conversion, etc. This can be a complex beast. Since you already have some bits and pieces, let's focus on the handling of silence suppression. For this to work you need to keep your own sample play out time and track that timebase against the incoming packets. If there's none, you'll have to insert the appropriate amount of silence yourself (based on the latest comfort noise parameters received). If there's one that needs to be interpreted, either updating the comfort noise info or as a packet full of audio samples.

Probably timekeeping at your end is what is crucial here.