Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to cec-dataprep #64

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Updates to cec-dataprep #64

wants to merge 3 commits into from

Conversation

aunshx
Copy link
Contributor

@aunshx aunshx commented Nov 26, 2024

Updates for FRREDSS 2.0

  • Updated schema for db
  • Two new files have been created to add the processed data into the db
  1. split_csv.py - This file will split the processed data into county_year.csv files, reprocess it (remove col discrepancies ) and store in a folder split_files. This folder has been uploaded to box. This code is to be run only once as follows:
    python split_csv.py path_of_the_processed_data_file.csv

  2. process_uploads.py - This file will take data from the split_files folder and add it to the treatedclusters db one by one and moves the upload files to the upload_completed folder. Checks whether county+year already exists. To run it:
    python process_uploads.py path_of_the_split_files_folder

@aunshx aunshx self-assigned this Nov 26, 2024
@aunshx aunshx changed the title Updated sql tables Updates to cec-dataprep Nov 26, 2024
);

create index idx_find_clusters
-- Index on the treatedclusters table
CREATE TABLE idx_find_clusters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aunshx this doesn't look right, i think it should be an index not table

land_use text,
forest_type text,
haz_class int4,
"Stem6to9_tonsAcre" double precision,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the same naming as the rest of the variables, so stem6to9_tonsAcre for this one. It hopefully wouldn't affect the import, but if it does i still thing i'd be better to handle the difference during import and not have 2 different naming schemes within one table

"Stem9Plus_tonsAcre" double precision,
"Branch_tonsAcre" double precision,
"Foliage_tonsAcre" double precision,
wood_density float4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking maybe make this double precision like the others. I'm sure it could technically fit in a float4 but I don't think the difference is consequential so better to pick one and go with it.

for filename in csv_files:
file_path = os.path.join(split_dir, filename)

county, year = filename.replace('.csv', '').split('_')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth checking you got back real values for these -- otherwise an extra file put into this directory will kill the whole thing

row = row[:15] + row[16:] # Remove the extra field

county = row[13]
year = row[2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the csv reader is there not a way to get the year and county by header name instead of needing to make sure the year is always in column 3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants