Blog(2)
Large language models (LLMs) have exhibited impressive in-context learning abilities [1]. Inspired by these successes, recent studies [2-5] have extended LLM applications to text-to-speech (TTS) systems by representing speech through discrete acoustic codes.
Zero-shot sketch-based image retrieval (ZS-SBIR) is a central problem to sketch understanding [6]. This paper aims to tackle all problems associated with the current status quo for ZS-SBIR, including category-level (standard) [4], fine-grained [1], and cross-dataset [3].
Research Areas(0)
Publications(21)
Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP
AuthorShell Xu Hu
PublishedConference on Empirical Methods in Natural Language Processing (EMNLP)
Date2024-11-13
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs
AuthorYassine Ouali, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
PublishedEuropean Conference on Computer Vision (ECCV)
Date2024-09-30
Modularized Multilingual NMT with Fine-grained Interlingua
AuthorSungjun Lim, Yoonjung Choi, Sangha Kim
PublishedNorth American Chapter of the Association for Computational Linguistics (NAACL)
Date2024-06-20
News(6)
Voice cloning, especially zero-shot speech synthesis, has become one of the most exciting frontiers in speech technology.
Recently, personalized AI systems have gained significant attention. In the TTS field, zero-shot text-to-speech (ZS-TTS) systems [1-7] enable users to create their own TTS systems that replicate their voices with just one utterance, without further training.
Others(0)