Lessons from Testing Contextual Retrieval on Multiple Datasets