preprocess.py

class ucas_dm.preprocess.preprocess.PreProcessor(source_data_path)[source]

Bases: object

classmethod build_tf_idf(id_tokens)[source]

This method builds TF-IDF vectors for news.

Parameters:id_tokens – A pandas.DataFrame contains news id and its tokens. |column1: news_id|column2: tokens|
Returns:A dict - {“id_tfvec”: A pandas.DataFrame contains news id and its tf-idf vector |column1: news_id|column2: tf_vec| ,”gensim_pack”:{“word2dict”: important parameter if package “gensim” is used for further process, “corpus”: important parameter if package “gensim” is used for further process}}
extract_news()[source]

This method extract news from data and save them to a csv file.

Returns:A pandas.DataFrame with two attributes: news_id and content
extract_view_log()[source]

This method extract user view log from data and save it to a csv file.

Returns:A pandas.DataFrame with 3 attributes: user_id, news_id, view_time
classmethod generate_tokens(id_content)[source]

This method generate tokens for news.

Parameters:id_content – A pandas.DataFrame of news id(integer) and its content(string) |column1: news_id|column2: content|
Returns:A pd.DataFrame of news id and its tokens