CNEr - Methods conserved non-coding elements

5. Methods conserved non-coding elements

5.1. CNEr

5. Methods conserved non-coding

Our sampling included four species, but CNEr only supports pairwise analysis. We did six pairwise analyses and we combined the three dierent CNEr output les per species.

Figure 5.1.: Graphical overview of the steps in the pipeline used forCNE prediction.

5.2. CNE_gene_neighbourhood.pl

To this end, we developed a custom perl script CNE_gene_neighbourhood.pl to combine the output les. The script needs one reference species with which the three dierent analyses were conducted. All ^CNE candidates were sorted and combined if they overlap by at least one nucleotide using the reference species. Also, for combined

CNEsthe borders of the ^CNE were adapted to include the longest sequence possible.

EachCNEof the reference species ended up with at least oneCNE candidate in another species. The ^CNEs in the other species were not checked for overlap in this script.

The CNEs of the reference species were not only checked for overlap but were also sorted into cluster. Two ^CNEs belonged to the same cluster if they are ≤20,000^bp apart from each other. Woolfe et al. (2004) showed that still 85 % of CNEs cluster within 370 kb distance, however, we chose this conservative distance to take into account our two assemblies that are not at chromosome level.

This script produces three output les. One contains the nucleotide sequence of each

CNE (cne_sequence_species.fa), the second information to each cluster (scaold, start, stop, count of ^CNEs, distance to scaold end) and all genes that were found within≤500 kb distance to this cluster or within it (cne_closest_gene-_species.tab), and the third the position information for each CNE in a cluster (cne_in_cluster-_species.txt). We used 500 kb as the distance because Woolfe et al. (2004) found that 93 % of the CNE cluster they identied had a trans-dev gene within this distance.

The species part of the le name is a placeholder for the reference species. These les still listed overlappingCNEsseparately but were combined into one in the next step.

5.3. unique_cnes_in_cluster.pl

The next script unique_cnes_in_cluster.pl takes the le cne_in_cluster_species.-txt as input and merges all the overlappingCNEs. It also checks if two clusters should be

merged. This case happened because the borders of each cluster, meaning the most outwards placed ^CNE, are expanded with each ^CNE that gets added. In some cases the distance between two clusters was≤20,000bpafter the nished analyses and that classies the^CNEsas belonging to the same cluster. If two clusters were 20,030^bpapart and one was extended by 31bpthey would now count as one cluster. This reevaluation was not done in the previous steps.

5.4. cne_gene_count.pl

The script cne_gene_count.pl takes the le with all genes neighbouring a cluster and counts how often each gene was present and saves this to a le (gene_list_cne_-species.tab). It also created two les that contained for each cluster the numbers of genes found upstream, downstream or within a cluster. The rst one contained all clus-ters (gene_in_-cluster_count_allcne_species.csv) and the second only those clusclus-ters that had a minimal distance of 500 kb to each scaold end (gene_in_cluster_count-_500cne_species.csv). We included the second le because the number of genes neigh-bouring clusters with ≤500 kb distance to the scaold end might be articially lower as the search for genes stops at the end of a scaold even though the actual chromo-some might be longer. The length of each individual cluster was added using the script cne_clusterno_length2csv.pl.

5.5. cne_get_one_closest_gene.pl

Further analysis focused on just the closest genes in cis on each cluster side. This gene was identied using the script cne_get_one_closest_gene.pl. It took the list of genes per cluster (cne_closest_gene_species.tab) created by the CNE_gene-_neighbourhood.pl script and the annotation of the species to nd it. The gene that was closest to the cluster is selected. Also the direction of the gene towards the cluster orientation was checked. If the closest gene was in cis to the cluster it was added with

the cluster info to the le cne_closest_two_genes_all_species.tab. In case this gene was in trans to this le 'na' was added and the cluster with the gene information was stored in the le cne_closest_two_genes_trans_all_species-.tab.

5.6. cne_synteny.pl

The next step was to get the information for each gene that was identied as closest to a cluster. We checked what type it is, meaning protein-coding orlncRNA, focusing further on those clusters that had an^lncRNAidentied as a possible interaction partner.

For those clusters with at least one lncRNA in cis as interaction partner we checked the synteny of the single^CNEs in the cluster.

The script cne_synteny.pl was used for this. It takes a le with the cluster informa-tion (scaold, start/stop posiinforma-tion, closest gene (upstream/downstream), distance to the gene (upstream/downstream) number of CNEs in the cluster, distance to scaold end) of interest of the reference species, in this case those clusters with an^lncRNAnext to them consisting of ≥10 CNEs, the CNEr output les, and the le for each species containing the nal ^CNE coordinates as provided by the unique_cne_in_cluster.pl script. This script provides information on the partner for each CNE in a cluster together with the position of the CNE in the other species. The le this script pro-duces (cne_closest_two_genes_all_hit-info_lnc_10_no-dups_species.tab) contains for each cluster provided a list ofCNE matches in the other species. As a link between the CNEs it provided, this le allows us the check for synteny between species.

5.7. cne_di_species_ident.pl

To visualise and compare the CNEr results, we created venn diagrams for each species.

For each species we collected theCNEspredicted in all three runs of CNEr and compared them. Predictions were counted as the same if at least 1 nt was overlapping. Using the script cne_di_species_ident.pl we created an ID for each CNE that made them comparable between the three output les. The list created for each output was then

passed on to Venny (Oliveros, 2015) which created the venn diagrams.

6. Results conserved non-coding

Im Dokument Non-coding RNAs and Conserved Non-coding Elements in Insect Genomes (Seite 85-91)