topic_based_algo.py

class ucas_dm.prediction_algorithms.topic_based_algo.InitialParams(**kwargs)[source]

Bases: object

This class contains some necessary data for the initialization of class TopicBasedAlgo.

__init__(**kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod load(fname)[source]

Load an object previously saved from a file

Parameters:fname – file path
Returns:object loaded from file
save(fname)[source]

This method save initial params to a file

Parameters:fname – file path
class ucas_dm.prediction_algorithms.topic_based_algo.TopicBasedAlgo(initial_params, topic_n=100, chunksize=100, topic_type='lda', power_iters=2, extra_samples=100, passes=1)[source]

Bases: ucas_dm.prediction_algorithms.base_algo.BaseAlgo

Content-based algorithm which use “Topic model” algorithms (LSI or LDA). Use delegation strategy

__init__(initial_params, topic_n=100, chunksize=100, topic_type='lda', power_iters=2, extra_samples=100, passes=1)[source]
Parameters:
  • initial_params – An instance of InitialParams generated by preprocess
  • topic_n – The number of requested latent topics to be extracted from the training corpus.
  • chunksize – Number of documents to be used in each training chunk.
  • topic_type – ‘lsi’ or ‘lda’
  • power_iters – (LSI parameter)Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance.
  • extra_samples – (LSI parameter)Extra samples to be used besides the rank k. Can improve accuracy.
  • passes – (LDA parameter)Number of passes through the corpus during training.
_generate_item_vector()[source]

Use LDA or LSI algorithm to process TF-IDF vector and generate new item vectors.

Returns:DataFrame contains item id and it’s new vector
static _rebuild_vector(partial_vector, dim)[source]
classmethod load(fname)[source]

Load an object previously saved from a file

Parameters:fname – file path
Returns:object loaded from file
classmethod preprocess(raw_data)[source]

Call this method to process raw data which contain item id and its content before initializing TopicBasedAlgo instance.

Parameters:raw_data – A pandas.DataFrame contains item id and content | id | content |
Returns:A InitialParams instance, a necessary parameter in the initialization of TopicBasedAlgo.
save(fname, *args)[source]

Save an object to a file.

Parameters:
  • fname – file path
  • ignore – a set of attributes that should’t be saved by super class, but subclass may have to handle these special attributes.
to_dict()[source]

See BaseAlgo.to_dict for more details.

top_k_recommend(u_id, k)[source]

Calculate the top-K recommend items

Parameters:
  • u_id – users’ identity (user’s id)
  • k – the number of the items that the recommender should return
Returns:

(v,id) v is a list contains predict rate or distance, id is a list contains top-k highest rated or nearest items

train(train_set)[source]

Do some train-set-dependent work here: for example calculate sims between users or items

Parameters:train_set – A pandas.DataFrame contains two attributes: user_id and item_id,which represents the user view record during a period of time.
Returns:return a model that is ready to give recommend