Unable to replicate performance #4

deklanw · 2020-12-19T04:29:23Z

I've attempted a reimplementation in PyTorch for the recsys framework RecBole here, RUCAIBox/RecBole#594 so that it's convenient to compare with other algorithms, etc.

I replicated your experiment almost exactly afaict: MovieLens100k, 70-20-10 split, early stopping with Recall@20. The only difference I see is that I didn't remove users with few interactions as you say in the paper

For this dataset, we maintain users with at least 5 interactions.

I used HyperOpt to do a search on the hyperparameter ranges specified in the paper (with an added option for dropout probability between 0.1 and 0.5) limited to 50 trials.

DGCF results:

best params:  {'dropout_prob': 0.24266119278104079, 'embedding_size': 128, 'learning_rate': 0.0016153742760160951, 'n_layers': 2, 'reg_weight': 2.031773354290135e-05}

'test_result': {'recall@20': 0.3248, 'mrr@20': 0.5986, 'ndcg@20': 0.3795, 'hit@20': 0.9618, 'precision@20': 0.2608}

I did the same for LightGCN

LightGCN results:

best params:  {'embedding_size': 128, 'learning_rate': 0.002856632032475591, 'n_layers': 2, 'reg_weight': 1.43923729841778e-05}

'test_result': {'recall@20': 0.3336, 'mrr@20': 0.6135, 'ndcg@20': 0.3868, 'hit@20': 0.9724, 'precision@20': 0.2629}

These figures are quite different from your paper, the ndcg especially, but in particular LightGCN is winning in every metric.

Is there anything not written in the paper that I might be missing in my implementation?

And, btw, are you applying node dropout to LightGCN (even though it wasn't a part of the algorithm originally, afaik)?

Thanks for any help!

The text was updated successfully, but these errors were encountered:

JimLiu96 · 2020-12-21T22:58:19Z

Thanks for sharing your results. I didn't run the experiment with HyperOpt. Does your result stay stable under these params? The reg weight for the best performance from my experience should be around 1e-2.

deklanw · 2020-12-22T05:18:45Z

I realized that I wasn't disabling the dropout during evaluation. Fixed it and ran Hyperopt for 100 iterations this time, to be extra sure. Results are about the same:

DGCF

best params:  {'dropout_prob': 0.023892894735004354, 'embedding_size': 64, 'learning_rate': 0.006279775923826556, 'n_layers': 3, 'reg_weight': 6.334030498942448e-05}

'test_result': {'recall@20': 0.3254, 'mrr@20': 0.5948, 'ndcg@20': 0.3778, 'hit@20': 0.965, 'precision@20': 0.259}}

LightGCN

best params:  {'embedding_size': 128, 'learning_rate': 0.004966963171170461, 'n_layers': 4, 'reg_weight': 0.0013284118691326246}

'test_result': {'recall@20': 0.3339, 'mrr@20': 0.6074, 'ndcg@20': 0.3866, 'hit@20': 0.9714, 'precision@20': 0.2655}}

I didn't run the experiment with HyperOpt. Does your result stay stable under these params? The reg weight for the best performance from my experience should be around 1e-2.

I'm using Hyperopt (instead of a naive grid search) to speed up the evaluation. Reg weight of around 1e-2 was tested during Hyperopt's search.

It's possible there is some other mistake in my implementation, I'm just not sure what it could be.

JimLiu96 · 2020-12-22T05:33:06Z

Actually, I use early-stop to control the training. Also, from my experience, the results are unstable for the ml100k dataset. Don't know how hyperopt solves this problem. The reported results are the best ones for all different models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to replicate performance #4

Unable to replicate performance #4

deklanw commented Dec 19, 2020

JimLiu96 commented Dec 21, 2020

deklanw commented Dec 22, 2020

JimLiu96 commented Dec 22, 2020

Unable to replicate performance #4

Unable to replicate performance #4

Comments

deklanw commented Dec 19, 2020

JimLiu96 commented Dec 21, 2020

deklanw commented Dec 22, 2020

JimLiu96 commented Dec 22, 2020