{ "cells": [ { "cell_type": "markdown", "id": "f11a76bf", "metadata": {}, "source": [ "# Exercise on Joins and anti-joins: add information from other tables" ] }, { "cell_type": "code", "execution_count": 1, "id": "b6f2742b", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Set some Pandas options: maximum number of rows/columns it's going to display\n", "#pd.set_option('display.max_rows', 1000)\n", "#pd.set_option('display.max_columns', 100)" ] }, { "cell_type": "markdown", "id": "2967c84e", "metadata": {}, "source": [ "# Load data from clinical trial\n", "\n", "Data comes in two different files. The file `predimed_records.csv` file contains the clinical data for each patient, except which diet group they were assigned. The file `predimed_mapping.csv` contain the information of which patient was assigned to which diet group. " ] }, { "cell_type": "code", "execution_count": 2, "id": "ed626ee3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patient-idlocation-idsexagesmokebmiwaistwthhtndiabhypercholfamhisthormop14toeventevent
04364Male58Former33.531220.753086NoNoYesNoNo105.374401Yes
111304Male77Current31.051190.730061YesYesNoNoNo106.097194No
211314Female72Former30.861060.654321NoYesNoYesNo85.946612No
311324Male71Former27.681180.694118YesNoYesNoNo82.907598Yes
411112Female79Never35.941290.806250YesNoYesNoNo94.761123No
\n", "
" ], "text/plain": [ " patient-id location-id sex age smoke bmi waist wth htn \\\n", "0 436 4 Male 58 Former 33.53 122 0.753086 No \n", "1 1130 4 Male 77 Current 31.05 119 0.730061 Yes \n", "2 1131 4 Female 72 Former 30.86 106 0.654321 No \n", "3 1132 4 Male 71 Former 27.68 118 0.694118 Yes \n", "4 1111 2 Female 79 Never 35.94 129 0.806250 Yes \n", "\n", " diab hyperchol famhist hormo p14 toevent event \n", "0 No Yes No No 10 5.374401 Yes \n", "1 Yes No No No 10 6.097194 No \n", "2 Yes No Yes No 8 5.946612 No \n", "3 No Yes No No 8 2.907598 Yes \n", "4 No Yes No No 9 4.761123 No " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('../../data/predimed_records.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "id": "48d5375f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
location-idpatient-idgroup
02885MedDiet + VOO
11182MedDiet + Nuts
21971MedDiet + Nuts
32691MedDiet + Nuts
42632Control
\n", "
" ], "text/plain": [ " location-id patient-id group\n", "0 2 885 MedDiet + VOO\n", "1 1 182 MedDiet + Nuts\n", "2 1 971 MedDiet + Nuts\n", "3 2 691 MedDiet + Nuts\n", "4 2 632 Control" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "info = pd.read_csv('../../data/predimed_mapping.csv')\n", "info.head()" ] }, { "cell_type": "markdown", "id": "2b4b98ed-d7ec-4b7c-b983-adc616d2f16f", "metadata": {}, "source": [ "There were 5 different locations where the study was conducted, each one gave an identification number `patient-id` to each participant." ] }, { "cell_type": "code", "execution_count": 4, "id": "b9dbc492-1489-4530-96ac-5f33f7389caa", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2, 1, 3, 4, 5])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "info['location-id'].unique()" ] }, { "cell_type": "markdown", "id": "2fef4d37", "metadata": {}, "source": [ "# 1. Add diet information to the patients' records\n", "\n", "* For how many patients do we have clinical information? (i.e., rows in `df`)\n", "* For how many patients do we have diet information? (i.e., rows in `info`)" ] }, { "cell_type": "code", "execution_count": 5, "id": "861ac334-14ce-490a-b3c4-877b32789f3e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6324" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 6, "id": "14f57842-5722-4953-88d6-d7cf3070400c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6287" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(info)" ] }, { "cell_type": "markdown", "id": "3f23fa17-af3e-41c3-883f-3e1279d4820e", "metadata": {}, "source": [ "Perform the merge, keeping in mind that it only make sense to analyze patients with the diet information. \n", "* Which type of merge would you do? \n", "* For how many patients do we have full information (records and which diet they followed? " ] }, { "cell_type": "code", "execution_count": 7, "id": "35e19a53", "metadata": {}, "outputs": [], "source": [ "df_with_info = df.merge(info, on=['patient-id', 'location-id'], how='right')" ] }, { "cell_type": "code", "execution_count": 8, "id": "eac1244f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "patient-id 6287\n", "location-id 6287\n", "sex 6287\n", "age 6287\n", "smoke 6287\n", "bmi 6287\n", "waist 6287\n", "wth 6287\n", "htn 6287\n", "diab 6287\n", "hyperchol 6287\n", "famhist 6287\n", "hormo 5629\n", "p14 6287\n", "toevent 6287\n", "event 6287\n", "group 6287\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_with_info.count()" ] }, { "cell_type": "code", "execution_count": 9, "id": "0d770d69-7a0a-47c9-bb5d-93a11329e7ad", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patient-idlocation-idsexagesmokebmiwaistwthhtndiabhypercholfamhisthormop14toeventeventgroup
08852Male74Former29.941070.681529YesYesYesNoNaN85.711157NoMedDiet + VOO
11821Female60Former30.76850.555556NoNoYesYesYes103.274470NoMedDiet + Nuts
29711Female65Never23.81860.540881YesYesYesNoNo63.088296NoMedDiet + Nuts
36912Female64Never32.701020.637500YesYesNoNoNo83.028063NoMedDiet + Nuts
46322Female73Never28.32910.594771YesYesYesYesNo95.919233NoControl
......................................................
62828551Male55Former29.771060.612717YesNoYesYesNo93.449692NoMedDiet + VOO
62837114Female78Never34.721040.712329YesNoYesNoNo91.921971NoControl
62841134Female60Never31.48980.640523NoYesYesNoNo106.403833NoMedDiet + VOO
6285505Female77Never28.92920.625850YesNoYesNoNo115.360712NoMedDiet + Nuts
6286854Female61Never37.501060.662500YesNoYesYesNo91.823409NoMedDiet + Nuts
\n", "

6287 rows × 17 columns

\n", "
" ], "text/plain": [ " patient-id location-id sex age smoke bmi waist wth \\\n", "0 885 2 Male 74 Former 29.94 107 0.681529 \n", "1 182 1 Female 60 Former 30.76 85 0.555556 \n", "2 971 1 Female 65 Never 23.81 86 0.540881 \n", "3 691 2 Female 64 Never 32.70 102 0.637500 \n", "4 632 2 Female 73 Never 28.32 91 0.594771 \n", "... ... ... ... ... ... ... ... ... \n", "6282 855 1 Male 55 Former 29.77 106 0.612717 \n", "6283 711 4 Female 78 Never 34.72 104 0.712329 \n", "6284 113 4 Female 60 Never 31.48 98 0.640523 \n", "6285 50 5 Female 77 Never 28.92 92 0.625850 \n", "6286 85 4 Female 61 Never 37.50 106 0.662500 \n", "\n", " htn diab hyperchol famhist hormo p14 toevent event group \n", "0 Yes Yes Yes No NaN 8 5.711157 No MedDiet + VOO \n", "1 No No Yes Yes Yes 10 3.274470 No MedDiet + Nuts \n", "2 Yes Yes Yes No No 6 3.088296 No MedDiet + Nuts \n", "3 Yes Yes No No No 8 3.028063 No MedDiet + Nuts \n", "4 Yes Yes Yes Yes No 9 5.919233 No Control \n", "... ... ... ... ... ... ... ... ... ... \n", "6282 Yes No Yes Yes No 9 3.449692 No MedDiet + VOO \n", "6283 Yes No Yes No No 9 1.921971 No Control \n", "6284 No Yes Yes No No 10 6.403833 No MedDiet + VOO \n", "6285 Yes No Yes No No 11 5.360712 No MedDiet + Nuts \n", "6286 Yes No Yes Yes No 9 1.823409 No MedDiet + Nuts \n", "\n", "[6287 rows x 17 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_with_info" ] }, { "cell_type": "markdown", "id": "946beb08-30a5-4020-8612-360385cdfc1e", "metadata": {}, "source": [ "# 2. Add location information to the patients' records\n", "\n", "There were five locations where the study was conducted. Here is a DataFrame containing the information of each location. \n", "\n", "- Add a new column to the dataset that contains the city where each patient was recorded.\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "36ce0688-d421-4a07-b00e-0e9b3201f0e0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
location-idCity
01Madrid
12Valencia
23Barcelona
34Bilbao
45Malaga
\n", "
" ], "text/plain": [ " location-id City\n", "0 1 Madrid\n", "1 2 Valencia\n", "2 3 Barcelona\n", "3 4 Bilbao\n", "4 5 Malaga" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "locations = pd.DataFrame.from_dict({'location-id': [1, 2, 3, 4, 5], \n", " 'City': ['Madrid', 'Valencia', 'Barcelona', 'Bilbao','Malaga']})\n", "locations" ] }, { "cell_type": "code", "execution_count": 11, "id": "b636dde4-129a-4dd1-8cbf-c539c9c8a5f2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patient-idlocation-idsexagesmokebmiwaistwthhtndiabhypercholfamhisthormop14toeventeventgroupCity
01821Female60Former30.76850.555556NoNoYesYesYes103.274470NoMedDiet + NutsMadrid
19711Female65Never23.81860.540881YesYesYesNoNo63.088296NoMedDiet + NutsMadrid
24851Male71Former22.41920.516854YesYesYesNoNo90.438056NoControlMadrid
36211Male71Former32.701100.662651NoYesYesNoNo112.661191NoMedDiet + NutsMadrid
49541Male70Former29.481070.633136NoYesYesNoNo84.186174NoMedDiet + NutsMadrid
.........................................................
628211205Male63Never28.83920.534884YesNoYesNoNaN93.392197NoMedDiet + VOOMalaga
628310825Female63Never25.33920.613333YesNoYesNoNo55.886379NoMedDiet + NutsMalaga
62843025Female59Never31.71800.519481NoYesYesNoNo85.631759NoMedDiet + VOOMalaga
62855665Male58Former27.81980.583333YesNoYesNoNo101.913758NoMedDiet + VOOMalaga
6286505Female77Never28.92920.625850YesNoYesNoNo115.360712NoMedDiet + NutsMalaga
\n", "

6287 rows × 18 columns

\n", "
" ], "text/plain": [ " patient-id location-id sex age smoke bmi waist wth \\\n", "0 182 1 Female 60 Former 30.76 85 0.555556 \n", "1 971 1 Female 65 Never 23.81 86 0.540881 \n", "2 485 1 Male 71 Former 22.41 92 0.516854 \n", "3 621 1 Male 71 Former 32.70 110 0.662651 \n", "4 954 1 Male 70 Former 29.48 107 0.633136 \n", "... ... ... ... ... ... ... ... ... \n", "6282 1120 5 Male 63 Never 28.83 92 0.534884 \n", "6283 1082 5 Female 63 Never 25.33 92 0.613333 \n", "6284 302 5 Female 59 Never 31.71 80 0.519481 \n", "6285 566 5 Male 58 Former 27.81 98 0.583333 \n", "6286 50 5 Female 77 Never 28.92 92 0.625850 \n", "\n", " htn diab hyperchol famhist hormo p14 toevent event group \\\n", "0 No No Yes Yes Yes 10 3.274470 No MedDiet + Nuts \n", "1 Yes Yes Yes No No 6 3.088296 No MedDiet + Nuts \n", "2 Yes Yes Yes No No 9 0.438056 No Control \n", "3 No Yes Yes No No 11 2.661191 No MedDiet + Nuts \n", "4 No Yes Yes No No 8 4.186174 No MedDiet + Nuts \n", "... ... ... ... ... ... ... ... ... ... \n", "6282 Yes No Yes No NaN 9 3.392197 No MedDiet + VOO \n", "6283 Yes No Yes No No 5 5.886379 No MedDiet + Nuts \n", "6284 No Yes Yes No No 8 5.631759 No MedDiet + VOO \n", "6285 Yes No Yes No No 10 1.913758 No MedDiet + VOO \n", "6286 Yes No Yes No No 11 5.360712 No MedDiet + Nuts \n", "\n", " City \n", "0 Madrid \n", "1 Madrid \n", "2 Madrid \n", "3 Madrid \n", "4 Madrid \n", "... ... \n", "6282 Malaga \n", "6283 Malaga \n", "6284 Malaga \n", "6285 Malaga \n", "6286 Malaga \n", "\n", "[6287 rows x 18 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_with_info = df_with_info.merge(locations, on='location-id', how='right')\n", "df_with_info" ] }, { "cell_type": "markdown", "id": "44031178", "metadata": {}, "source": [ "# 3. Remove drops from table\n", "\n", "Some patients drop from the study early on and they should be removed from our analysis. Their IDS are stored in file `dropped.csv`.\n", "1. Load the list of patients who droped, from `dropped.csv`\n", "2. Use an anti-join to remove them from the table\n", "3. How many patients (rows) are left in the data?" ] }, { "cell_type": "code", "execution_count": 12, "id": "d1d4cc27", "metadata": {}, "outputs": [], "source": [ "dropped = pd.read_csv('dropped.csv')" ] }, { "cell_type": "code", "execution_count": 13, "id": "fbebbd97", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(42, 2)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dropped.shape" ] }, { "cell_type": "code", "execution_count": 14, "id": "8a3c7943", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
location-idpatient-id
01217
111147
211170
31627
44541
\n", "
" ], "text/plain": [ " location-id patient-id\n", "0 1 217\n", "1 1 1147\n", "2 1 1170\n", "3 1 627\n", "4 4 541" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dropped.head()" ] }, { "cell_type": "code", "execution_count": 15, "id": "573687e7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patient-idlocation-idsexagesmokebmiwaistwthhtndiabhypercholfamhisthormop14toeventeventgroupCity_merge
011Female77Never25.92940.657343YesNoYesYesNo95.538672NoMedDiet + VOOMadridleft_only
121Female68Never34.851500.949367YesNoYesYesNaN103.063655NoMedDiet + NutsMadridleft_only
231Female66Never37.501200.750000YesYesNoNoNo65.590691NoMedDiet + NutsMadridleft_only
341Female77Never29.26930.628378YesYesNoNoNo65.456537NoMedDiet + VOOMadridleft_only
451Female60Never30.021040.662420YesNoYesNoNo92.746064NoControlMadridleft_only
............................................................
628212535Male79Never25.281050.640244YesNoYesNoNo85.828884NoMedDiet + VOOMalagaleft_only
628312545Male62Former27.101040.594286YesNoYesYesNo95.067762NoMedDiet + NutsMalagaleft_only
628412555Female65Never35.021030.686667YesNoYesNoNo101.993155NoMedDiet + VOOMalagaleft_only
628512565Male61Never28.42940.576687YesYesNoNoNo92.039699NoMedDiet + NutsMalagaleft_only
628612575Male58Former24.43930.547059YesYesYesNoNo92.590007NoMedDiet + NutsMalagaleft_only
\n", "

6287 rows × 19 columns

\n", "
" ], "text/plain": [ " patient-id location-id sex age smoke bmi waist wth \\\n", "0 1 1 Female 77 Never 25.92 94 0.657343 \n", "1 2 1 Female 68 Never 34.85 150 0.949367 \n", "2 3 1 Female 66 Never 37.50 120 0.750000 \n", "3 4 1 Female 77 Never 29.26 93 0.628378 \n", "4 5 1 Female 60 Never 30.02 104 0.662420 \n", "... ... ... ... ... ... ... ... ... \n", "6282 1253 5 Male 79 Never 25.28 105 0.640244 \n", "6283 1254 5 Male 62 Former 27.10 104 0.594286 \n", "6284 1255 5 Female 65 Never 35.02 103 0.686667 \n", "6285 1256 5 Male 61 Never 28.42 94 0.576687 \n", "6286 1257 5 Male 58 Former 24.43 93 0.547059 \n", "\n", " htn diab hyperchol famhist hormo p14 toevent event group \\\n", "0 Yes No Yes Yes No 9 5.538672 No MedDiet + VOO \n", "1 Yes No Yes Yes NaN 10 3.063655 No MedDiet + Nuts \n", "2 Yes Yes No No No 6 5.590691 No MedDiet + Nuts \n", "3 Yes Yes No No No 6 5.456537 No MedDiet + VOO \n", "4 Yes No Yes No No 9 2.746064 No Control \n", "... ... ... ... ... ... ... ... ... ... \n", "6282 Yes No Yes No No 8 5.828884 No MedDiet + VOO \n", "6283 Yes No Yes Yes No 9 5.067762 No MedDiet + Nuts \n", "6284 Yes No Yes No No 10 1.993155 No MedDiet + VOO \n", "6285 Yes Yes No No No 9 2.039699 No MedDiet + Nuts \n", "6286 Yes Yes Yes No No 9 2.590007 No MedDiet + Nuts \n", "\n", " City _merge \n", "0 Madrid left_only \n", "1 Madrid left_only \n", "2 Madrid left_only \n", "3 Madrid left_only \n", "4 Madrid left_only \n", "... ... ... \n", "6282 Malaga left_only \n", "6283 Malaga left_only \n", "6284 Malaga left_only \n", "6285 Malaga left_only \n", "6286 Malaga left_only \n", "\n", "[6287 rows x 19 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp = df_with_info.merge(dropped, on=['location-id', 'patient-id'], how='outer', indicator=True)\n", "temp" ] }, { "cell_type": "code", "execution_count": 16, "id": "a4a6574b", "metadata": {}, "outputs": [], "source": [ "df_without_dropped = temp[temp['_merge'] == 'left_only'].drop('_merge', axis=1)" ] }, { "cell_type": "code", "execution_count": 17, "id": "8fd89a40", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(6245, 18)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_without_dropped.shape" ] }, { "cell_type": "code", "execution_count": 18, "id": "07f4776a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
patient-idlocation-idsexagesmokebmiwaistwthhtndiabhypercholfamhisthormop14toeventeventgroupCity
011Female77Never25.92940.657343YesNoYesYesNo95.538672NoMedDiet + VOOMadrid
121Female68Never34.851500.949367YesNoYesYesNaN103.063655NoMedDiet + NutsMadrid
231Female66Never37.501200.750000YesYesNoNoNo65.590691NoMedDiet + NutsMadrid
341Female77Never29.26930.628378YesYesNoNoNo65.456537NoMedDiet + VOOMadrid
451Female60Never30.021040.662420YesNoYesNoNo92.746064NoControlMadrid
\n", "
" ], "text/plain": [ " patient-id location-id sex age smoke bmi waist wth htn \\\n", "0 1 1 Female 77 Never 25.92 94 0.657343 Yes \n", "1 2 1 Female 68 Never 34.85 150 0.949367 Yes \n", "2 3 1 Female 66 Never 37.50 120 0.750000 Yes \n", "3 4 1 Female 77 Never 29.26 93 0.628378 Yes \n", "4 5 1 Female 60 Never 30.02 104 0.662420 Yes \n", "\n", " diab hyperchol famhist hormo p14 toevent event group City \n", "0 No Yes Yes No 9 5.538672 No MedDiet + VOO Madrid \n", "1 No Yes Yes NaN 10 3.063655 No MedDiet + Nuts Madrid \n", "2 Yes No No No 6 5.590691 No MedDiet + Nuts Madrid \n", "3 Yes No No No 6 5.456537 No MedDiet + VOO Madrid \n", "4 No Yes No No 9 2.746064 No Control Madrid " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_without_dropped.head()" ] }, { "cell_type": "markdown", "id": "84270332", "metadata": {}, "source": [ "# 4. Save final result in `processed_data_predimed.csv`\n", "\n", "1. Using the `.to_csv` method of Pandas DataFrames" ] }, { "cell_type": "code", "execution_count": 19, "id": "85902eea", "metadata": {}, "outputs": [], "source": [ "df_without_dropped.to_csv('processed_data_predimed.csv', index=None)" ] }, { "cell_type": "code", "execution_count": null, "id": "c7bcff45", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 5 }