PREPROCESS DATA BY TEXT SIMILARITY