Open
Conversation
Signed-off-by: David Davó <david@ddavo.me>
Collaborator
Author
|
The tests fail because some problem with dependencies but I have not touched the dependencies |
Collaborator
Author
|
Seems to be kind of an open issue in PyTorch. It is simply faster to do it "by hand", instead of using the Dataset class, even if you use the official TensorDataset it will use a collate function that is a python list comprehension, way slower than using tensors and slices. |
Collaborator
|
@daviddavo I'm testing it in my local computer |
| "Western", | ||
| ) | ||
|
|
||
| # 1m data occupation index to string mapper. For 100k, the occupation labels are already in the dataset. |
Collaborator
There was a problem hiding this comment.
@daviddavo the extra info is only for 100k and 1M?
| @@ -47,6 +47,7 @@ def __init__( | |||
| has_header=False, | |||
Collaborator
There was a problem hiding this comment.
I got these errors:
(recommenders311) miguel@miguel:~/MS/recommenders$ pytest tests/data_validation/recommenders/datasets/test_movielens.py --disable-warnings --durations 0
========================================================================= test session starts =========================================================================
platform linux -- Python 3.11.9, pytest-8.2.2, pluggy-1.5.0
rootdir: /home/miguel/MS/recommenders
configfile: pyproject.toml
plugins: anyio-4.4.0, cov-5.0.0, typeguard-4.3.0, hypothesis-6.104.2, mock-3.14.0
collected 71 items
tests/data_validation/recommenders/datasets/test_movielens.py ...................................FFFFFF..FFFF....................FF.. [100%]
=================================== FAILURES ====================================
___________________ test_download_and_extract_movielens[100k] ___________________
size = '100k', tmp = '/tmp/pytest-of-miguel/pytest-77/tmphd6aiueo'
@pytest.mark.parametrize("size", ["100k", "1m", "10m", "20m"])
def test_download_and_extract_movielens(size, tmp):
"""Test movielens data download and extract"""
zip_path = os.path.join(tmp, "ml.zip")
download_movielens(size, dest_path=zip_path)
assert len(os.listdir(tmp)) == 1
assert os.path.exists(zip_path) is True
rating_path = os.path.join(tmp, "rating.dat")
item_path = os.path.join(tmp, "item.dat")
> extract_movielens(
size, rating_path=rating_path, item_path=item_path, zip_path=zip_path
)
E TypeError: extract_movielens() missing 1 required positional argument: 'user_path'
tests/data_validation/recommenders/datasets/test_movielens.py:125: TypeError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 4.81k/4.81k [00:01<00:00, 3.97kKB/s]
____________________ test_download_and_extract_movielens[1m] ____________________
size = '1m', tmp = '/tmp/pytest-of-miguel/pytest-77/tmpt47xilf3'
@pytest.mark.parametrize("size", ["100k", "1m", "10m", "20m"])
def test_download_and_extract_movielens(size, tmp):
"""Test movielens data download and extract"""
zip_path = os.path.join(tmp, "ml.zip")
download_movielens(size, dest_path=zip_path)
assert len(os.listdir(tmp)) == 1
assert os.path.exists(zip_path) is True
rating_path = os.path.join(tmp, "rating.dat")
item_path = os.path.join(tmp, "item.dat")
> extract_movielens(
size, rating_path=rating_path, item_path=item_path, zip_path=zip_path
)
E TypeError: extract_movielens() missing 1 required positional argument: 'user_path'
tests/data_validation/recommenders/datasets/test_movielens.py:125: TypeError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 5.78k/5.78k [00:01<00:00, 4.59kKB/s]
___________________ test_download_and_extract_movielens[10m] ____________________
size = '10m', tmp = '/tmp/pytest-of-miguel/pytest-77/tmpjrveosih'
@pytest.mark.parametrize("size", ["100k", "1m", "10m", "20m"])
def test_download_and_extract_movielens(size, tmp):
"""Test movielens data download and extract"""
zip_path = os.path.join(tmp, "ml.zip")
download_movielens(size, dest_path=zip_path)
assert len(os.listdir(tmp)) == 1
assert os.path.exists(zip_path) is True
rating_path = os.path.join(tmp, "rating.dat")
item_path = os.path.join(tmp, "item.dat")
> extract_movielens(
size, rating_path=rating_path, item_path=item_path, zip_path=zip_path
)
E TypeError: extract_movielens() missing 1 required positional argument: 'user_path'
tests/data_validation/recommenders/datasets/test_movielens.py:125: TypeError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 64.0k/64.0k [00:03<00:00, 16.4kKB/s]
___________________ test_download_and_extract_movielens[20m] ____________________
size = '20m', tmp = '/tmp/pytest-of-miguel/pytest-77/tmpw2vclyex'
@pytest.mark.parametrize("size", ["100k", "1m", "10m", "20m"])
def test_download_and_extract_movielens(size, tmp):
"""Test movielens data download and extract"""
zip_path = os.path.join(tmp, "ml.zip")
download_movielens(size, dest_path=zip_path)
assert len(os.listdir(tmp)) == 1
assert os.path.exists(zip_path) is True
rating_path = os.path.join(tmp, "rating.dat")
item_path = os.path.join(tmp, "item.dat")
> extract_movielens(
size, rating_path=rating_path, item_path=item_path, zip_path=zip_path
)
E TypeError: extract_movielens() missing 1 required positional argument: 'user_path'
tests/data_validation/recommenders/datasets/test_movielens.py:125: TypeError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 194k/194k [00:10<00:00, 18.7kKB/s]
_ test_load_pandas_df[100k-100000-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995] _
size = '100k', num_samples = 100000, num_movies = 1682, movie_example = 1
title_example = 'Toy Story (1995)'
genres_example = "Animation|Children's|Comedy", year_example = '1995'
tmp = '/tmp/pytest-of-miguel/pytest-77/tmp0woedy4a'
@pytest.mark.parametrize(
"size, num_samples, num_movies, movie_example, title_example, genres_example, year_example",
[
(
"100k",
100000,
1682,
1,
"Toy Story (1995)",
"Animation|Children's|Comedy",
"1995",
),
(
"1m",
1000209,
3883,
1,
"Toy Story (1995)",
"Animation|Children's|Comedy",
"1995",
),
(
"10m",
10000054,
10681,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
(
"20m",
20000263,
27278,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
],
)
def test_load_pandas_df(
size,
num_samples,
num_movies,
movie_example,
title_example,
genres_example,
year_example,
tmp,
):
"""Test MovieLens dataset load as pd.DataFrame"""
# Test if correct data are loaded
header = ["a", "b", "c"]
df = load_pandas_df(size=size, local_cache_path=tmp, header=header)
assert len(df) == num_samples
assert len(df.columns) == len(header)
# Test if raw-zip file, rating file, and item file are cached
> assert len(os.listdir(tmp)) == 3
E AssertionError: assert 4 == 3
E + where 4 = len(['u.data', 'u.user', 'u.item', 'ml-100k.zip'])
E + where ['u.data', 'u.user', 'u.item', 'ml-100k.zip'] = <built-in function listdir>('/tmp/pytest-of-miguel/pytest-77/tmp0woedy4a')
E + where <built-in function listdir> = os.listdir
tests/data_validation/recommenders/datasets/test_movielens.py:192: AssertionError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 4.81k/4.81k [00:01<00:00, 4.20kKB/s]
_ test_load_pandas_df[1m-1000209-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995] _
size = '1m', num_samples = 1000209, num_movies = 3883, movie_example = 1
title_example = 'Toy Story (1995)'
genres_example = "Animation|Children's|Comedy", year_example = '1995'
tmp = '/tmp/pytest-of-miguel/pytest-77/tmpnjmy_mcj'
@pytest.mark.parametrize(
"size, num_samples, num_movies, movie_example, title_example, genres_example, year_example",
[
(
"100k",
100000,
1682,
1,
"Toy Story (1995)",
"Animation|Children's|Comedy",
"1995",
),
(
"1m",
1000209,
3883,
1,
"Toy Story (1995)",
"Animation|Children's|Comedy",
"1995",
),
(
"10m",
10000054,
10681,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
(
"20m",
20000263,
27278,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
],
)
def test_load_pandas_df(
size,
num_samples,
num_movies,
movie_example,
title_example,
genres_example,
year_example,
tmp,
):
"""Test MovieLens dataset load as pd.DataFrame"""
# Test if correct data are loaded
header = ["a", "b", "c"]
df = load_pandas_df(size=size, local_cache_path=tmp, header=header)
assert len(df) == num_samples
assert len(df.columns) == len(header)
# Test if raw-zip file, rating file, and item file are cached
> assert len(os.listdir(tmp)) == 3
E AssertionError: assert 4 == 3
E + where 4 = len(['users.dat', 'ml-1m.zip', 'ratings.dat', 'movies.dat'])
E + where ['users.dat', 'ml-1m.zip', 'ratings.dat', 'movies.dat'] = <built-in function listdir>('/tmp/pytest-of-miguel/pytest-77/tmpnjmy_mcj')
E + where <built-in function listdir> = os.listdir
tests/data_validation/recommenders/datasets/test_movielens.py:192: AssertionError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 5.78k/5.78k [00:01<00:00, 4.63kKB/s]
_ test_load_item_df[100k-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995] _
size = '100k', num_movies = 1682, movie_example = 1
title_example = 'Toy Story (1995)'
genres_example = "Animation|Children's|Comedy", year_example = '1995'
tmp = '/tmp/pytest-of-miguel/pytest-77/tmpq4bt6q3n'
@pytest.mark.parametrize(
"size, num_movies, movie_example, title_example, genres_example, year_example",
[
("100k", 1682, 1, "Toy Story (1995)", "Animation|Children's|Comedy", "1995"),
("1m", 3883, 1, "Toy Story (1995)", "Animation|Children's|Comedy", "1995"),
(
"10m",
10681,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
(
"20m",
27278,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
],
)
def test_load_item_df(
size,
num_movies,
movie_example,
title_example,
genres_example,
year_example,
tmp,
):
"""Test movielens item data load (not rating data)"""
> df = load_item_df(size, local_cache_path=tmp, title_col="title")
tests/data_validation/recommenders/datasets/test_movielens.py:264:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
size = '100k', local_cache_path = '/tmp/pytest-of-miguel/pytest-77/tmpq4bt6q3n'
movie_col = 'itemID', title_col = 'title', genres_col = None, year_col = None
def load_item_df(
size="100k",
local_cache_path=None,
movie_col=DEFAULT_ITEM_COL,
title_col=None,
genres_col=None,
year_col=None,
):
"""Loads Movie info.
Args:
size (str): Size of the data to load. One of ("100k", "1m", "10m", "20m").
local_cache_path (str): Path (directory or a zip file) to cache the downloaded zip file.
If None, all the intermediate files will be stored in a temporary directory and removed after use.
movie_col (str): Movie id column name.
title_col (str): Movie title column name. If None, the column will not be loaded.
genres_col (str): Genres column name. Genres are '|' separated string.
If None, the column will not be loaded.
year_col (str): Movie release year column name. If None, the column will not be loaded.
Returns:
pandas.DataFrame: Movie information data, such as title, genres, and release year.
"""
size = size.lower()
if size not in DATA_FORMAT:
raise ValueError(f"Size: {size}. " + ERROR_MOVIE_LENS_SIZE)
with download_path(local_cache_path) as path:
filepath = os.path.join(path, "ml-{}.zip".format(size))
> _, item_datapath = _maybe_download_and_extract(size, filepath)
E ValueError: too many values to unpack (expected 2)
recommenders/datasets/movielens.py:335: ValueError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 4.81k/4.81k [00:01<00:00, 3.37kKB/s]
_ test_load_item_df[1m-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995] _
size = '1m', num_movies = 3883, movie_example = 1
title_example = 'Toy Story (1995)'
genres_example = "Animation|Children's|Comedy", year_example = '1995'
tmp = '/tmp/pytest-of-miguel/pytest-77/tmpirqfi1bs'
@pytest.mark.parametrize(
"size, num_movies, movie_example, title_example, genres_example, year_example",
[
("100k", 1682, 1, "Toy Story (1995)", "Animation|Children's|Comedy", "1995"),
("1m", 3883, 1, "Toy Story (1995)", "Animation|Children's|Comedy", "1995"),
(
"10m",
10681,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
(
"20m",
27278,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
],
)
def test_load_item_df(
size,
num_movies,
movie_example,
title_example,
genres_example,
year_example,
tmp,
):
"""Test movielens item data load (not rating data)"""
> df = load_item_df(size, local_cache_path=tmp, title_col="title")
tests/data_validation/recommenders/datasets/test_movielens.py:264:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
size = '1m', local_cache_path = '/tmp/pytest-of-miguel/pytest-77/tmpirqfi1bs'
movie_col = 'itemID', title_col = 'title', genres_col = None, year_col = None
def load_item_df(
size="100k",
local_cache_path=None,
movie_col=DEFAULT_ITEM_COL,
title_col=None,
genres_col=None,
year_col=None,
):
"""Loads Movie info.
Args:
size (str): Size of the data to load. One of ("100k", "1m", "10m", "20m").
local_cache_path (str): Path (directory or a zip file) to cache the downloaded zip file.
If None, all the intermediate files will be stored in a temporary directory and removed after use.
movie_col (str): Movie id column name.
title_col (str): Movie title column name. If None, the column will not be loaded.
genres_col (str): Genres column name. Genres are '|' separated string.
If None, the column will not be loaded.
year_col (str): Movie release year column name. If None, the column will not be loaded.
Returns:
pandas.DataFrame: Movie information data, such as title, genres, and release year.
"""
size = size.lower()
if size not in DATA_FORMAT:
raise ValueError(f"Size: {size}. " + ERROR_MOVIE_LENS_SIZE)
with download_path(local_cache_path) as path:
filepath = os.path.join(path, "ml-{}.zip".format(size))
> _, item_datapath = _maybe_download_and_extract(size, filepath)
E ValueError: too many values to unpack (expected 2)
recommenders/datasets/movielens.py:335: ValueError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 5.78k/5.78k [00:01<00:00, 4.72kKB/s]
_ test_load_item_df[10m-10681-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995] _
size = '10m', num_movies = 10681, movie_example = 1
title_example = 'Toy Story (1995)'
genres_example = 'Adventure|Animation|Children|Comedy|Fantasy'
year_example = '1995', tmp = '/tmp/pytest-of-miguel/pytest-77/tmp068p6mvt'
@pytest.mark.parametrize(
"size, num_movies, movie_example, title_example, genres_example, year_example",
[
("100k", 1682, 1, "Toy Story (1995)", "Animation|Children's|Comedy", "1995"),
("1m", 3883, 1, "Toy Story (1995)", "Animation|Children's|Comedy", "1995"),
(
"10m",
10681,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
(
"20m",
27278,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
],
)
def test_load_item_df(
size,
num_movies,
movie_example,
title_example,
genres_example,
year_example,
tmp,
):
"""Test movielens item data load (not rating data)"""
> df = load_item_df(size, local_cache_path=tmp, title_col="title")
tests/data_validation/recommenders/datasets/test_movielens.py:264:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
size = '10m', local_cache_path = '/tmp/pytest-of-miguel/pytest-77/tmp068p6mvt'
movie_col = 'itemID', title_col = 'title', genres_col = None, year_col = None
def load_item_df(
size="100k",
local_cache_path=None,
movie_col=DEFAULT_ITEM_COL,
title_col=None,
genres_col=None,
year_col=None,
):
"""Loads Movie info.
Args:
size (str): Size of the data to load. One of ("100k", "1m", "10m", "20m").
local_cache_path (str): Path (directory or a zip file) to cache the downloaded zip file.
If None, all the intermediate files will be stored in a temporary directory and removed after use.
movie_col (str): Movie id column name.
title_col (str): Movie title column name. If None, the column will not be loaded.
genres_col (str): Genres column name. Genres are '|' separated string.
If None, the column will not be loaded.
year_col (str): Movie release year column name. If None, the column will not be loaded.
Returns:
pandas.DataFrame: Movie information data, such as title, genres, and release year.
"""
size = size.lower()
if size not in DATA_FORMAT:
raise ValueError(f"Size: {size}. " + ERROR_MOVIE_LENS_SIZE)
with download_path(local_cache_path) as path:
filepath = os.path.join(path, "ml-{}.zip".format(size))
> _, item_datapath = _maybe_download_and_extract(size, filepath)
E ValueError: too many values to unpack (expected 2)
recommenders/datasets/movielens.py:335: ValueError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 64.0k/64.0k [00:03<00:00, 16.5kKB/s]
_ test_load_item_df[20m-27278-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995] _
size = '20m', num_movies = 27278, movie_example = 1
title_example = 'Toy Story (1995)'
genres_example = 'Adventure|Animation|Children|Comedy|Fantasy'
year_example = '1995', tmp = '/tmp/pytest-of-miguel/pytest-77/tmp81nnj7fw'
@pytest.mark.parametrize(
"size, num_movies, movie_example, title_example, genres_example, year_example",
[
("100k", 1682, 1, "Toy Story (1995)", "Animation|Children's|Comedy", "1995"),
("1m", 3883, 1, "Toy Story (1995)", "Animation|Children's|Comedy", "1995"),
(
"10m",
10681,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
(
"20m",
27278,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
],
)
def test_load_item_df(
size,
num_movies,
movie_example,
title_example,
genres_example,
year_example,
tmp,
):
"""Test movielens item data load (not rating data)"""
> df = load_item_df(size, local_cache_path=tmp, title_col="title")
tests/data_validation/recommenders/datasets/test_movielens.py:264:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
size = '20m', local_cache_path = '/tmp/pytest-of-miguel/pytest-77/tmp81nnj7fw'
movie_col = 'itemID', title_col = 'title', genres_col = None, year_col = None
def load_item_df(
size="100k",
local_cache_path=None,
movie_col=DEFAULT_ITEM_COL,
title_col=None,
genres_col=None,
year_col=None,
):
"""Loads Movie info.
Args:
size (str): Size of the data to load. One of ("100k", "1m", "10m", "20m").
local_cache_path (str): Path (directory or a zip file) to cache the downloaded zip file.
If None, all the intermediate files will be stored in a temporary directory and removed after use.
movie_col (str): Movie id column name.
title_col (str): Movie title column name. If None, the column will not be loaded.
genres_col (str): Genres column name. Genres are '|' separated string.
If None, the column will not be loaded.
year_col (str): Movie release year column name. If None, the column will not be loaded.
Returns:
pandas.DataFrame: Movie information data, such as title, genres, and release year.
"""
size = size.lower()
if size not in DATA_FORMAT:
raise ValueError(f"Size: {size}. " + ERROR_MOVIE_LENS_SIZE)
with download_path(local_cache_path) as path:
filepath = os.path.join(path, "ml-{}.zip".format(size))
> _, item_datapath = _maybe_download_and_extract(size, filepath)
E ValueError: too many values to unpack (expected 2)
recommenders/datasets/movielens.py:335: ValueError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 194k/194k [00:09<00:00, 19.6kKB/s]
_ test_load_spark_df[100k-100000-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995] _
size = '100k', num_samples = 100000, num_movies = 1682, movie_example = 1
title_example = 'Toy Story (1995)'
genres_example = "Animation|Children's|Comedy", year_example = '1995'
tmp = '/tmp/pytest-of-miguel/pytest-77/tmpi7dpzm72'
spark = <pyspark.sql.session.SparkSession object at 0x7fc65a36bbd0>
@pytest.mark.spark
@pytest.mark.parametrize(
"size, num_samples, num_movies, movie_example, title_example, genres_example, year_example",
[
(
"100k",
100000,
1682,
1,
"Toy Story (1995)",
"Animation|Children's|Comedy",
"1995",
),
(
"1m",
1000209,
3883,
1,
"Toy Story (1995)",
"Animation|Children's|Comedy",
"1995",
),
(
"10m",
10000054,
10681,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
(
"20m",
20000263,
27278,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
],
)
def test_load_spark_df(
size,
num_samples,
num_movies,
movie_example,
title_example,
genres_example,
year_example,
tmp,
spark,
):
"""Test MovieLens dataset load into pySpark.DataFrame"""
# Test if correct data are loaded
header = ["1", "2", "3"]
schema = StructType(
[
StructField("u", IntegerType()),
StructField("m", IntegerType()),
]
)
with pytest.warns(Warning):
df = load_spark_df(
spark, size=size, local_cache_path=tmp, header=header, schema=schema
)
assert df.count() == num_samples
# Test if schema is used when both schema and header are provided
assert len(df.columns) == len(schema)
# Test if raw-zip file, rating file, and item file are cached
> assert len(os.listdir(tmp)) == 3
E AssertionError: assert 4 == 3
E + where 4 = len(['u.data', 'u.user', 'u.item', 'ml-100k.zip'])
E + where ['u.data', 'u.user', 'u.item', 'ml-100k.zip'] = <built-in function listdir>('/tmp/pytest-of-miguel/pytest-77/tmpi7dpzm72')
E + where <built-in function listdir> = os.listdir
tests/data_validation/recommenders/datasets/test_movielens.py:488: AssertionError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 4.81k/4.81k [00:01<00:00, 4.35kKB/s]
_ test_load_spark_df[1m-1000209-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995] _
size = '1m', num_samples = 1000209, num_movies = 3883, movie_example = 1
title_example = 'Toy Story (1995)'
genres_example = "Animation|Children's|Comedy", year_example = '1995'
tmp = '/tmp/pytest-of-miguel/pytest-77/tmpipedkdgk'
spark = <pyspark.sql.session.SparkSession object at 0x7fc65a36bbd0>
@pytest.mark.spark
@pytest.mark.parametrize(
"size, num_samples, num_movies, movie_example, title_example, genres_example, year_example",
[
(
"100k",
100000,
1682,
1,
"Toy Story (1995)",
"Animation|Children's|Comedy",
"1995",
),
(
"1m",
1000209,
3883,
1,
"Toy Story (1995)",
"Animation|Children's|Comedy",
"1995",
),
(
"10m",
10000054,
10681,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
(
"20m",
20000263,
27278,
1,
"Toy Story (1995)",
"Adventure|Animation|Children|Comedy|Fantasy",
"1995",
),
],
)
def test_load_spark_df(
size,
num_samples,
num_movies,
movie_example,
title_example,
genres_example,
year_example,
tmp,
spark,
):
"""Test MovieLens dataset load into pySpark.DataFrame"""
# Test if correct data are loaded
header = ["1", "2", "3"]
schema = StructType(
[
StructField("u", IntegerType()),
StructField("m", IntegerType()),
]
)
with pytest.warns(Warning):
df = load_spark_df(
spark, size=size, local_cache_path=tmp, header=header, schema=schema
)
assert df.count() == num_samples
# Test if schema is used when both schema and header are provided
assert len(df.columns) == len(schema)
# Test if raw-zip file, rating file, and item file are cached
> assert len(os.listdir(tmp)) == 3
E AssertionError: assert 4 == 3
E + where 4 = len(['users.dat', 'ml-1m.zip', 'ratings.dat', 'movies.dat'])
E + where ['users.dat', 'ml-1m.zip', 'ratings.dat', 'movies.dat'] = <built-in function listdir>('/tmp/pytest-of-miguel/pytest-77/tmpipedkdgk')
E + where <built-in function listdir> = os.listdir
tests/data_validation/recommenders/datasets/test_movielens.py:488: AssertionError
----------------------------- Captured stderr call ------------------------------
100%|██████████| 5.78k/5.78k [00:01<00:00, 4.72kKB/s]
=============================== slowest durations ===============================
229.46s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df[20m-20000263-27278-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
187.80s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df[20m-20000263-27278-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
108.28s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df[10m-10000054-10681-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
105.43s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df[10m-10000054-10681-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
79.74s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df_mock_100__with_custom_param__succeed
71.77s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df_mock_100__with_default_param__succeed
55.13s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df_mock_100__with_default_param__succeed
54.47s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df_mock_100__with_custom_param__succeed
14.17s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__has_default_col_names[100]
13.83s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[20m-27278-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
12.26s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_user_df_error[20m]
10.89s call tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[20m]
6.74s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[10m-10681-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
6.13s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_user_df_error[10m]
5.83s setup tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[0-101-True-True]
5.76s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[0-101-True-True]
5.19s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df[1m-1000209-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995]
4.80s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df[1m-1000209-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995]
4.46s call tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[10m]
3.49s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[3-101-True-True]
3.30s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__has_default_col_names[10]
3.24s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[10-101-True-True]
3.18s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[10-101-False-False]
3.13s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[10--1-None-False-True]
3.09s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df[100k-100000-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995]
2.96s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[10-101-False-True]
2.86s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[10-101-True-False]
2.85s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[10--1-None-True-False]
2.78s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[10--1-None-True-True]
2.78s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[3-101-False-True]
2.65s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[3-101-False-False]
2.64s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[3-101-True-False]
2.53s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[10--1-None-False-False]
2.38s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df_remove_default_col__return_success[4]
2.26s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__data_serialization_default_param
2.26s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df[100k-100000-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995]
2.25s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[3--1-None-False-False]
2.25s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[3--1-None-True-False]
2.15s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[3--1-None-False-True]
2.07s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__store_tmp_file
2.05s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_user_df[1m-6040-1-1-F-K-12 student-48067]
2.00s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[3--1-None-True-True]
1.99s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[100k-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995]
1.98s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[1m-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995]
1.96s call tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[1m]
1.95s call tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[100k]
1.73s call tests/data_validation/recommenders/datasets/test_movielens.py::test_load_user_df[100k-943-1-24-M-technician-85711]
1.45s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df_remove_default_col__return_success[2]
1.40s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df_remove_default_col__return_success[3]
1.36s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[10--1-2-True-True]
1.35s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[10--1-2-True-False]
1.28s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[3--1-2-False-True]
1.25s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[10--1-2-False-True]
1.23s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[3--1-2-True-True]
1.21s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[3--1-2-True-False]
1.19s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df[20m-20000263-27278-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
1.09s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[10--1-2-False-False]
0.98s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[3--1-2-False-False]
0.47s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[0-101-True-False]
0.41s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[0-101-False-True]
0.39s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_spark_df__return_success[0-101-False-False]
0.17s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df[20m-20000263-27278-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
0.09s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[20m-27278-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
0.06s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_user_df_error[20m]
0.05s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[10m-10681-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
0.03s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[20m]
0.03s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_user_df_error[10m]
0.03s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df[10m-10000054-10681-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
0.02s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df[10m-10000054-10681-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995]
0.02s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[0--1-None-False-True]
0.02s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[0--1-2-True-True]
0.02s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[0--1-2-True-False]
0.02s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[0--1-2-False-False]
0.02s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[0--1-None-False-False]
0.02s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[0--1-2-False-True]
0.01s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[0--1-None-True-False]
0.01s call tests/data_validation/recommenders/datasets/test_movielens.py::test_mock_movielens_schema__get_df__return_success[0--1-None-True-True]
0.01s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[10m]
0.01s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[100k-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995]
0.01s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[1m-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995]
0.01s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_user_df[1m-6040-1-1-F-K-12 student-48067]
0.01s teardown tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df[1m-1000209-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995]
(131 durations < 0.005s hidden. Use -vv to show these durations.)
============================ short test summary info ============================
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[100k] - TypeError: extract_movielens() missing 1 required positional argument: 'user...
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[1m] - TypeError: extract_movielens() missing 1 required positional argument: 'user...
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[10m] - TypeError: extract_movielens() missing 1 required positional argument: 'user...
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_download_and_extract_movielens[20m] - TypeError: extract_movielens() missing 1 required positional argument: 'user...
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df[100k-100000-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995] - AssertionError: assert 4 == 3
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_load_pandas_df[1m-1000209-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995] - AssertionError: assert 4 == 3
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[100k-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995] - ValueError: too many values to unpack (expected 2)
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[1m-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995] - ValueError: too many values to unpack (expected 2)
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[10m-10681-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995] - ValueError: too many values to unpack (expected 2)
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_load_item_df[20m-27278-1-Toy Story (1995)-Adventure|Animation|Children|Comedy|Fantasy-1995] - ValueError: too many values to unpack (expected 2)
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df[100k-100000-1682-1-Toy Story (1995)-Animation|Children's|Comedy-1995] - AssertionError: assert 4 == 3
FAILED tests/data_validation/recommenders/datasets/test_movielens.py::test_load_spark_df[1m-1000209-3883-1-Toy Story (1995)-Animation|Children's|Comedy-1995] - AssertionError: assert 4 == 3
=========== 12 failed, 59 passed, 6057 warnings in 2526.02s (0:42:06) ===========
Collaborator
|
@daviddavo we had a weekly meeting with @anargyri. How is this PR going? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Added capability to get user fields like gender, age, occupation and zip code to movielens dataset.
Related Issues
References
Checklist:
git commit -s -m "your commit message".staging branchAND NOT TOmain branch.