preprocess.py¶
-
class
ucas_dm.preprocess.preprocess.
PreProcessor
(source_data_path)[source]¶ Bases:
object
-
classmethod
build_tf_idf
(id_tokens)[source]¶ This method builds TF-IDF vectors for news.
Parameters: id_tokens – A pandas.DataFrame contains news id and its tokens. |column1: news_id|column2: tokens| Returns: A dict - {“id_tfvec”: A pandas.DataFrame contains news id and its tf-idf vector |column1: news_id|column2: tf_vec| ,”gensim_pack”:{“word2dict”: important parameter if package “gensim” is used for further process, “corpus”: important parameter if package “gensim” is used for further process}}
-
extract_news
()[source]¶ This method extract news from data and save them to a csv file.
Returns: A pandas.DataFrame with two attributes: news_id and content
-
classmethod