{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "collapsed_sections": [ "xJJatsUev-We" ] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "code", "source": [ "import pandas as pd\n", "import re" ], "metadata": { "id": "40uCf910RUdV" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df = pd.read_excel('/content/data_en_it_tagged.xlsx')" ], "metadata": { "id": "w3Iur8DPVlHe" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "9LaOvu7jVpYL", "outputId": "02e3a47b-b776-4226-e003-56c94fa73709" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " direction id text_type \\\n", "0 en_to_it 0001en_sp_st en_sp_st \n", "1 en_to_it 0002en_sp_st en_sp_st \n", "2 en_to_it 0003en_sp_st en_sp_st \n", "3 en_to_it 0004en_sp_st en_sp_st \n", "4 en_to_it 0005en_sp_st en_sp_st \n", ".. ... ... ... \n", "523 it_to_en 1064en_wr_tt en_wr_tt \n", "524 it_to_en 1065en_wr_tt en_wr_tt \n", "525 it_to_en 1066en_wr_tt en_wr_tt \n", "526 it_to_en 1067en_wr_tt en_wr_tt \n", "527 it_to_en 1068en_wr_tt en_wr_tt \n", "\n", " text \n", "0 Thank/VV you/PP President/NP ./SENT Well/RB... \n", "1 Thank/VV you/PP very/RB much/JJ Mr/NP Pre... \n", "2 Excuse/VV me/PP ./SENT Thank/VV you/PP Pre... \n", "3 President/NP ,/, the/DT upheaval/NN in/IN ... \n", "4 Thank/VV you/PP Mr/NP President/NP ./SENT ... \n", ".. ... \n", "523 Mr/NP President/NP ,/, ladies/NNS and/CC g... \n", "524 Mr/NP President/NP ,/, High/NP Representati... \n", "525 Mr/NP President/NP ,/, ladies/NNS and/CC g... \n", "526 Mr/NP President/NP ,/, ladies/NNS and/CC g... \n", "527 Mr/NP President/NP ,/, ladies/NNS and/CC g... \n", "\n", "[528 rows x 4 columns]" ], "text/html": [ "\n", "
\n", " | direction | \n", "id | \n", "text_type | \n", "text | \n", "
---|---|---|---|---|
0 | \n", "en_to_it | \n", "0001en_sp_st | \n", "en_sp_st | \n", "Thank/VV you/PP President/NP ./SENT Well/RB... | \n", "
1 | \n", "en_to_it | \n", "0002en_sp_st | \n", "en_sp_st | \n", "Thank/VV you/PP very/RB much/JJ Mr/NP Pre... | \n", "
2 | \n", "en_to_it | \n", "0003en_sp_st | \n", "en_sp_st | \n", "Excuse/VV me/PP ./SENT Thank/VV you/PP Pre... | \n", "
3 | \n", "en_to_it | \n", "0004en_sp_st | \n", "en_sp_st | \n", "President/NP ,/, the/DT upheaval/NN in/IN ... | \n", "
4 | \n", "en_to_it | \n", "0005en_sp_st | \n", "en_sp_st | \n", "Thank/VV you/PP Mr/NP President/NP ./SENT ... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
523 | \n", "it_to_en | \n", "1064en_wr_tt | \n", "en_wr_tt | \n", "Mr/NP President/NP ,/, ladies/NNS and/CC g... | \n", "
524 | \n", "it_to_en | \n", "1065en_wr_tt | \n", "en_wr_tt | \n", "Mr/NP President/NP ,/, High/NP Representati... | \n", "
525 | \n", "it_to_en | \n", "1066en_wr_tt | \n", "en_wr_tt | \n", "Mr/NP President/NP ,/, ladies/NNS and/CC g... | \n", "
526 | \n", "it_to_en | \n", "1067en_wr_tt | \n", "en_wr_tt | \n", "Mr/NP President/NP ,/, ladies/NNS and/CC g... | \n", "
527 | \n", "it_to_en | \n", "1068en_wr_tt | \n", "en_wr_tt | \n", "Mr/NP President/NP ,/, ladies/NNS and/CC g... | \n", "
528 rows × 4 columns
\n", "\n", " | direction | \n", "id | \n", "text_type | \n", "sttr | \n", "text | \n", "
---|---|---|---|---|---|
0 | \n", "en_to_it | \n", "0001en_sp_st, 0002en_sp_st, 0003en_sp_st | \n", "en_sp_st | \n", "0.461924 | \n", "Thank you President Well some colleagues took ... | \n", "
1 | \n", "en_to_it | \n", "0003en_sp_st, 0004en_sp_st, 0005en_sp_st, 0006... | \n", "en_sp_st | \n", "0.478478 | \n", "refrain from using violence and that there wil... | \n", "
2 | \n", "en_to_it | \n", "0006en_sp_st, 0007en_sp_st, 0008en_sp_st | \n", "en_sp_st | \n", "0.431156 | \n", "their hard work their thoughtfulness and commi... | \n", "
3 | \n", "en_to_it | \n", "0008en_sp_st, 0009en_sp_st, 0010en_sp_st | \n", "en_sp_st | \n", "0.424000 | \n", "into our Committee to do that I think has been... | \n", "
4 | \n", "en_to_it | \n", "0010en_sp_st, 0011en_sp_st, 0012en_sp_st, 0013... | \n", "en_sp_st | \n", "0.427711 | \n", "that these measures when endorsed and adopted ... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
137 | \n", "it_to_en | \n", "1045en_wr_tt, 1046en_wr_tt, 1047en_wr_tt, 1048... | \n", "en_wr_tt | \n", "0.439759 | \n", "regions and between administrative structures ... | \n", "
138 | \n", "it_to_en | \n", "1049en_wr_tt, 1050en_wr_tt, 1051en_wr_tt, 1052... | \n", "en_wr_tt | \n", "0.442000 | \n", "by the European Union are those specifically i... | \n", "
139 | \n", "it_to_en | \n", "1055en_wr_tt, 1056en_wr_tt, 1057en_wr_tt, 1058... | \n", "en_wr_tt | \n", "0.463928 | \n", "couples around the world who every day face th... | \n", "
140 | \n", "it_to_en | \n", "1059en_wr_tt, 1060en_wr_tt, 1061en_wr_tt, 1062... | \n", "en_wr_tt | \n", "0.483903 | \n", "We are now about to adopt the agreement on Ira... | \n", "
141 | \n", "it_to_en | \n", "1063en_wr_tt, 1064en_wr_tt, 1065en_wr_tt, 1066... | \n", "en_wr_tt | \n", "0.458753 | \n", "book Premesse della politica Premises of pol... | \n", "
142 rows × 5 columns
\n", "