GitHub - mizuamedesu/plamo-2-translate-quantization

https://huggingface.co/pfnet/plamo-2-translate

多分8bit量子化は11GB、4bit量子化は8GBあれば動くと思われる。以下はBitsAndBytesでフルサイズ、8bit量子化、4bit量子化の結果

========================================
Device: cuda
Model: pfnet/plamo-2-translate
Test sentences: 10

==================================================
Benchmarking: No Quantization
==================================================
Loading model with quantization: None
tokenizer_config.json: 100%|███████████████| 1.43k/1.43k [00:00<00:00, 12.8MB/s]
tokenization_plamo.py: 100%|███████████████| 16.9k/16.9k [00:00<00:00, 40.4MB/s]
A new version of the following files was downloaded from https://huggingface.co/pfnet/plamo-2-translate:
- tokenization_plamo.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
tokenizer.jsonl: 100%|█████████████████████| 10.6M/10.6M [00:00<00:00, 54.3MB/s]
special_tokens_map.json: 100%|█████████████████| 587/587 [00:00<00:00, 4.86MB/s]
config.json: 100%|█████████████████████████| 1.18k/1.18k [00:00<00:00, 12.3MB/s]
modeling_plamo.py: 100%|███████████████████| 66.8k/66.8k [00:00<00:00, 12.7MB/s]
A new version of the following files was downloaded from https://huggingface.co/pfnet/plamo-2-translate:
- modeling_plamo.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
model.safetensors.index.json: 100%|████████| 37.2k/37.2k [00:00<00:00, 26.0MB/s]
model-00004-of-00004.safetensors: 100%|█████| 4.49G/4.49G [00:27<00:00, 163MB/s]
model-00001-of-00004.safetensors: 100%|█████| 4.77G/4.77G [00:29<00:00, 163MB/s]
model-00003-of-00004.safetensors: 100%|█████| 4.83G/4.83G [00:30<00:00, 161MB/s]
model-00002-of-00004.safetensors: 100%|█████| 4.96G/4.96G [00:30<00:00, 164MB/s]
Fetching 4 files: 100%|███████████████████████████| 4/4 [00:30<00:00,  7.60s/it]
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:03<00:00,  1.08it/s]
generation_config.json: 100%|██████████████████| 132/132 [00:00<00:00, 1.42MB/s]
Model loading time: 44.36s
GPU memory usage: 18430.70MB
Running inference on 10 sentences...
[ 1] The weather is beautiful today -> 33.447s
[ 2] Machine learning is revolution -> 0.552s
[ 3] I love reading books in the li -> 0.489s
[ 4] The cat is sleeping on the sof -> 0.489s
[ 5] Artificial intelligence will c -> 0.549s
[ 6] She enjoys cooking traditional -> 0.489s
[ 7] The mountains look magnificent -> 0.496s
[ 8] Technology connects people aro -> 0.551s
[ 9] Music has the power to heal th -> 0.594s
[10] Education is the key to succes -> 0.561s

Results for No Quantization:
  Average inference time: 3.822s
  Total inference time: 38.220s
  Memory usage: 18430.70MB
  Model loading time: 44.36s

==================================================
Benchmarking: 8-bit Quantization
==================================================
Loading model with quantization: BitsAndBytesConfig {
  "_load_in_4bit": false,
  "_load_in_8bit": true,
  "bnb_4bit_compute_dtype": "float32",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "fp4",
  "bnb_4bit_use_double_quant": false,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": false,
  "load_in_8bit": true,
  "quant_method": "bitsandbytes"
}

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:21<00:00,  5.26s/it]
Model loading time: 26.76s
GPU memory usage: 10145.92MB
Running inference on 10 sentences...
[ 1] The weather is beautiful today -> 0.690s
[ 2] Machine learning is revolution -> 1.146s
[ 3] I love reading books in the li -> 1.029s
[ 4] The cat is sleeping on the sof -> 1.028s
[ 5] Artificial intelligence will c -> 1.150s
[ 6] She enjoys cooking traditional -> 1.030s
[ 7] The mountains look magnificent -> 1.148s
[ 8] Technology connects people aro -> 1.268s
[ 9] Music has the power to heal th -> 1.146s
[10] Education is the key to succes -> 1.153s

Results for 8-bit Quantization:
  Average inference time: 1.079s
  Total inference time: 10.788s
  Memory usage: 10145.92MB
  Model loading time: 26.76s

==================================================
Benchmarking: 4-bit NF4 Quantization
==================================================
Loading model with quantization: BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:17<00:00,  4.42s/it]
Model loading time: 23.43s
GPU memory usage: 6120.78MB
Running inference on 10 sentences...
[ 1] The weather is beautiful today -> 0.399s
[ 2] Machine learning is revolution -> 0.668s
[ 3] I love reading books in the li -> 0.599s
[ 4] The cat is sleeping on the sof -> 0.598s
[ 5] Artificial intelligence will c -> 0.667s
[ 6] She enjoys cooking traditional -> 0.530s
[ 7] The mountains look magnificent -> 0.661s
[ 8] Technology connects people aro -> 0.660s
[ 9] Music has the power to heal th -> 0.658s
[10] Education is the key to succes -> 0.657s

Results for 4-bit NF4 Quantization:
  Average inference time: 0.610s
  Total inference time: 6.096s
  Memory usage: 6120.78MB
  Model loading time: 23.43s

============================================================
QUANTIZATION PERFORMANCE SUMMARY
============================================================
Memory Reduction (8-bit): 45.0%
Memory Reduction (4-bit): 66.8%
Inference Speedup (8-bit): 3.54x
Inference Speedup (4-bit): 6.27x

================================================================================
DETAILED BENCHMARK RESULTS
================================================================================
         Configuration Avg Inference Time (s) Memory Usage (MB) Model Loading Time (s)
       No Quantization                  3.822           18430.7                  44.36
    8-bit Quantization                  1.079           10145.9                  26.76
4-bit NF4 Quantization                  0.610            6120.8                  23.43

============================================================
SAMPLE TRANSLATIONS
============================================================

Input: The weather is beautiful today.
No Quantization     : 今日は天気が美しい。
8-bit Quantization  : 今日は天気が美しい。
4-bit NF4 Quantization: 今日は天気が美しい。

Input: Machine learning is revolutionizing technology.
No Quantization     : 機械学習はテクノロジーに革命をもたらしている。
8-bit Quantization  : 機械学習はテクノロジーに革命をもたらしている。
4-bit NF4 Quantization: 機械学習はテクノロジーに革命をもたらしている。

Input: I love reading books in the library.
No Quantization     : 私は図書館で本を読むのが好きだ。
8-bit Quantization  : 私は図書館で本を読むのが好きだ。
4-bit NF4 Quantization: 私は図書館で本を読むのが大好きです。

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
api.py		api.py
compose.yml		compose.yml
index.html		index.html
quantization_4bit.py		quantization_4bit.py
quantization_8bit.py		quantization_8bit.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages